DEV Community

loading...
Cover image for Automatically Delete Files From Amazon S3 Bucket With SubFolders Over A Duration Using Python

Automatically Delete Files From Amazon S3 Bucket With SubFolders Over A Duration Using Python

Richard Debrah
・10 min read

As per Amazon:

Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform, offering over 165 fully featured services from data centers globally. Millions of customers —including the fastest-growing startups, largest enterprises, and leading government agencies—trust AWS to power their infrastructure, become more agile, and lower costs.

One of the numerous services Amazon provide is the Simple Storage Service popularly known as s3. Amazon S3 is a great way to store files for the short or for the long term. Many institutions depend greatly on Amazon s3(Cloudinary for example-last checked March, 2019) to store files ranging from log files in txt format to by uncle's half-sister's father's grandma's .gif photos that my future son put together. You get what I mean.

It is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means institutions of all shapes and sizes can use it to store their data for all use cases. Amazon claims to offer competitive prices but that is not what I will be writing about.
After a while one will want to purge some if not all of the files stored on the Amason s3 for a number of reasons. These may include compliance-when an instititution is charged to retain certain data/files for a specified amount of duration, saving space-you are charged by amazon for the size of the files you have saved up on the s3 so you are better off deleting unused files, and just for the heck of it.

Why don't you just go unto the s3 console and delete the file you want or write a shell code to recursively remove those files? Well, good luck doing that when you have over 2 million records of files saved up over a long period of time. Maybe my son's uncle's auntie's friend can help you.

I was faced with the task of removing some files from my company's Amazon s3 bucket and below is how I went about it. With the help of one @DavidOlinsky who helped me clean it up. AWSCLI was not of much help as I needed control over what I delete and what I leave out. This code was written in python. My Assumption for a Bucket Structure is as follows with their file dates;

├── [ 160 Jun 10 15:05]  my_s3_bucket
│   ├── [ 128 Jun 10 15:05]  level-one-folder1
│   │   └── [ 608 Jun 10 15:18]  another-sub-folder
│   │       ├── [   0 Mar 26  2015]  file1.jpg
│   │       ├── [   0 Mar 26  2015]  file10.jpg
│   │       ├── [   0 Mar 26  2015]  file2.jpg
│   │       ├── [   0 Mar 26  2015]  file3.jpg
│   │       ├── [   0 Mar 26  2015]  file4.jpg
│   │       ├── [   0 Mar 26  2015]  file5.jpg
│   │       ├── [   0 Mar 26  2015]  file6.jpg
│   │       ├── [   0 Mar 26  2015]  file7.jpg
│   │       ├── [   0 Mar 26  2015]  file8.jpg
│   │       ├── [   0 Mar 26  2015]  file9.jpg
│   │       ├── [ 416 Jun 10 15:19]  folder-inside-sub-folder
│   │       │   ├── [   0 Mar 26  2010]  culp1.test
│   │       │   ├── [   0 Mar 26  2010]  culp10.test
│   │       │   ├── [   0 Mar 26  2010]  culp2.test
│   │       │   ├── [   0 Mar 26  2010]  culp3.test
│   │       │   ├── [   0 Mar 26  2010]  culp4.test
│   │       │   ├── [   0 Mar 26  2010]  culp5.test
│   │       │   ├── [   0 Mar 26  2010]  culp6.test
│   │       │   ├── [   0 Mar 26  2010]  culp7.test
│   │       │   ├── [   0 Mar 26  2010]  culp8.test
│   │       │   └── [   0 Mar 26  2010]  culp9.test
│   │       ├── [   0 Jun 10 15:17]  newer1.config
│   │       ├── [   0 Jun 10 15:17]  newer2.config
│   │       ├── [   0 Jun 10 15:17]  newer3.config
│   │       ├── [   0 Jun 10 15:17]  newer4.config
│   │       └── [   0 Jun 10 15:17]  newer5.config
│   └── [ 384 Jun 10 15:35]  level-one-folder2
│       ├── [   0 Mar 26  2005]  old1.txt
│       ├── [   0 Mar 26  2005]  old10.txt
│       ├── [   0 Mar 26  2005]  old2.txt
│       ├── [   0 Mar 26  2005]  old3.txt
│       ├── [   0 Mar 26  2005]  old4.txt
│       ├── [   0 Mar 26  2005]  old5.txt
│       ├── [   0 Mar 26  2005]  old6.txt
│       ├── [   0 Mar 26  2005]  old7.txt
│       ├── [   0 Mar 26  2005]  old8.txt
│       └── [   0 Mar 26  2005]  old9.txt

Requirements

  1. boto3
  2. time
  3. sys

You can install boto3 by running pip install boto3 of you use pip or conda install boto3 or by any means that you are able to install python modules. Boto3 is amazon's own python library used to access their services. You can visit https://aws.amazon.com/ for all infortion regarding their libraries and swervices.

Imports

import boto3
import time
import sys

We are importing boto3 to be able to access our s3 services, time to help as a checkpoint for the duration from now (which could be any set time) and sys to generate error output to the terminal.

Let us set our variables. Note the placeholder values.

# todays\'s epoch
_tday = time.time()
duration = 86400*180 #180 days in epoch seconds
#checkpoint for deletion
_expire_limit = tday-duration
# initialize s3 client
s3_client = boto3.client('s3')
my_bucket = "my-s3-bucket"
my_ftp_key = "my-s3-key/"
_file_size = [] #just to keep track of the total savings in storage size

The above should be straight forward with the comments but in case you need more explanation you can continue reading this paragraph. You can look up the PEP 8 naming Style Guide For Python for the reasons for underscores in case you are curious. I am a student of this field so kindly bear with me on my errors.

We can set a particular current time for the our app to check from but for the sake of the majority I will just use _tday as the current date.

duration is the the UNIX Epoch time up to which the file will be checked to (any file older than that date will be removed). Here we are setting 180 days (six months) in UNIX Epoch seconds.

_expire_limit will just be the difference between the current date and the duration we need. Any file bearing lastModified date lesser that the value is older and will be removed.

s3_client will just be calling the low level Amazon s3 client for its methods/functions.

my_bucket will be the name of the bucket within which the files are contained. The files can be in subfolder(s) within the bucket. Don't worry, we will take care of that.

my_ftp_key will be the name of a particular subfolder within the bucket from which you want to search for the files. If you do not have subfolders you can ignore anything key that I mention (passed as Prefix).

We will use _file_size to hold the size in bytes of all the files that has been processed.

_del_size will reset continuously as the files delete and will help us determine the total size deleted even when an error occurs.

Next will be our functions.

Functions

#works to only get us key/file information
def get_key_info(bucket="my-s3-bucket", prefix="my-s3-key/"):

    print(f"Getting S3 Key Name, Size and LastModified from the Bucket: {bucket} with Prefix: {prefix}")

    key_names = []
    file_timestamp = []
    file_size = []
    kwargs = {"Bucket": bucket, "Prefix": prefix}
    while True:
        response = s3_client.list_objects_v2(**kwargs)
        for obj in response["Contents"]:
            # exclude directories/folder from results. Remove this if folders are to be removed too
            if "." in obj["Key"]:
                key_names.append(obj["Key"])
                file_timestamp.append(obj["LastModified"].timestamp())
                file_size.append(obj["Size"])
        try:
            kwargs["ContinuationToken"] = response["NextContinuationToken"]
        except KeyError:
            break

    key_info = {
        "key_path": key_names,
        "timestamp": file_timestamp,
        "size": file_size
    }
    print(f'All Keys in {bucket} with {prefix} Prefix found!')

    return key_info


# Check if date passed is older than date limit
def _check_expiration(key_date=_tday, limit=_expire_limit):
    if key_date < limit:
        return True


# connect to s3 and delete the file
def delete_s3_file(file_path, bucket=my_bucket):
    print(f"Deleting {file_path}")
    s3_client.delete_object(Bucket=bucket, Key=file_path)
    return True


# check size deleted
def _total_size_dltd(size):
    _file_size.append(size)
    _del_size = round(sum(_file_size)/1.049e+6, 2) #convert from bytes to mebibytes
    return _del_size

The get_key_info function takes in two parameters, a Bucket name and Prefix, all which will be passed to the s3 client method called list_objects_v2. This method takes in a couple of arguments one of which is the ContinuationToken. The list_objects_v2 method is only able to return a maximum of 1000 records pereach call. Amazon employes a Pagination method which returns to us a key for us to use to call the next set of 1000 records in case you have a lot more files to go through.

For the sake of clarity let us imagine having a book with a about 10 pages. Everytime you open a page you are able to see only what is on that particular page. You know there are more pages because on the bottom right there is a text that reads 2/10. Considering the lines on the pages continue from page 1 to 10 and each page bears 10 lines, we know that page 2 will start from 11 to 20 and page 5 will start from 41 to 50.

This is how list_objects_v2 works. Each time we make a call it returns us the next page's starting number so we can make the next call telling list_objects_v2 to start getting us the next set starting from that number.

This is achieved here in a while loop where after each call we try to reset the ContinuationToken until we hit a blocker which will be a KeyError.

It is important to check the reponse syntax to know what kind of information you will require from this object. In our care our file info lives within the "Contents" key.

We only returned a key_info object since that function's job is to only return us the relevant info we need from the call.

_check_expiration only takes in two parameters. if nothing is passed to it then it will use the default values from the variables above. It only returns True for each file when the LastModified date on s3 is older than the set duration.

delete_s3_file takes in the file_path which is the path of the file on s3 starting from the Key(prefix). of no bucket name is passed it will default to my_bucket.

_total_size_dltd keeps track of the storage sizes of the deleted files in MB. A simple Google Conversion Tool can help. At any point in time if there is any error if a file was deleted we will be presented with the total size deleted.

Finally we need to run our app. This app can be called/referenced from any other project that you have as long as you understand what each part is about. We will start by running our saved python file directly so the following code;

if __name__ == "__main__":
    try:
        s3_file = get_key_info()
        for i, fs in enumerate(s3_file["timestamp"]):
            file_expired = _check_expiration(fs)
            if file_expired: #if True is recieved
                file_deleted = delete_s3_file(s3_file["key_path"][i])
                if file_deleted: #if file is deleted
                    _del_size = _total_size_dltd(s3_file["size"][i])

        print(f"Total File(s) Size Deleted: {_del_size} MB")
    except:
        print ("failure:", sys.exc_info()[1])
        print(f"Total File(s) Size Deleted: {_del_size} MB")

In case your bucket name is my_bucket and your prefix or key is my_ftp_key then running this without passing any parameter will run through the folder my_ftp_key inside of my_bucket and remove any file older than 180 days from the time you run the app.

I will now demonstrate how to remove the old files from our assummed folder structure.

  1. If we decide to remove older files inside folder-inside-sub-folder only this will be our call.
if __name__ == "__main__":
    try:
        # the difference is right here
        s3_file = get_key_info("my_s3_bucket", "folder-inside-sub-folder")
        for i, fs in enumerate(s3_file["timestamp"]):
            # you can pass the duration in epoch as second parameter
            file_expired = _check_expiration(fs)
            if file_expired:
                file_deleted = delete_s3_file(s3_file["key_path"][i])
                if file_deleted:
                    _del_size = _total_size_dltd(s3_file["size"][i])

        print(f"Total File(s) Size Deleted: {_del_size} MB")
    except:
        print ("failure:", sys.exc_info()[1])
        print(f"Total File(s) Size Deleted: {_del_size} MB")
  1. If we want to remove all old files from level-one-folder1 our call will be
if __name__ == "__main__":
    try:
        # the difference is right here
        s3_file = get_key_info("my_s3_bucket", "level-one-folder1")
        for i, fs in enumerate(s3_file["timestamp"]):
            # you can pass the duration in epoch as second parameter
            file_expired = _check_expiration(fs)
            if file_expired:
                file_deleted = delete_s3_file(s3_file["key_path"][i])
                if file_deleted:
                    _del_size = _total_size_dltd(s3_file["size"][i])

        print(f"Total File(s) Size Deleted: {_del_size} MB")
    except:
        print ("failure:", sys.exc_info()[1])
        print(f"Total File(s) Size Deleted: {_del_size} MB")

This will remove all older files inside of another-sub-folder as well as folder-inside-sub-folder since they are inside of level-one-folder1. However, if we are checking by 180 days older files, then the files newer1.config to newer5.config inside of another-sub-folder will not be touched as they do not pass the expired test.

I am sure you can think of many other ways you can use this code to call deletions on your s3 bucket based on a duration, whether from a process script using argv or setting a monthly automatic process to run maintenance. Below is a summary

import boto3
import time
import sys

# todays\'s epoch
_tday = time.time()
duration = 86400*180 #180 days in epoch seconds
#checkpoint for deletion
_expire_limit = tday-duration
# initialize s3 client
s3_client = boto3.client('s3')
my_bucket = "my-s3-bucket"
my_ftp_key = "my-s3-key/"
_file_size = [] #just to keep track of the total savings in storage size

#Functions
#works to only get us key/file information
def get_key_info(bucket="my-s3-bucket", prefix="my-s3-key/"):

    print(f"Getting S3 Key Name, Size and LastModified from the Bucket: {bucket} with Prefix: {prefix}")

    key_names = []
    file_timestamp = []
    file_size = []
    kwargs = {"Bucket": bucket, "Prefix": prefix}
    while True:
        response = s3_client.list_objects_v2(**kwargs)
        for obj in response["Contents"]:
            # exclude directories/folder from results. Remove this if folders are to be removed too
            if "." in obj["Key"]:
                key_names.append(obj["Key"])
                file_timestamp.append(obj["LastModified"].timestamp())
                file_size.append(obj["Size"])
        try:
            kwargs["ContinuationToken"] = response["NextContinuationToken"]
        except KeyError:
            break

    key_info = {
        "key_path": key_names,
        "timestamp": file_timestamp,
        "size": file_size
    }
    print(f'All Keys in {bucket} with {prefix} Prefix found!')

    return key_info


# Check if date passed is older than date limit
def _check_expiration(key_date=_tday, limit=_expire_limit):
    if key_date < limit:
        return True


# connect to s3 and delete the file
def delete_s3_file(file_path, bucket=my_bucket):
    print(f"Deleting {file_path}")
    s3_client.delete_object(Bucket=bucket, Key=file_path)
    return True


# check size deleted
def _total_size_dltd(size):
    _file_size.append(size)
    _del_size = round(sum(_file_size)/1.049e+6, 2) #convert from bytes to mebibytes
    return _del_size


if __name__ == "__main__":
    try:
        s3_file = get_key_info()
        for i, fs in enumerate(s3_file["timestamp"]):
            file_expired = _check_expiration(fs)
            if file_expired: #if True is recieved
                file_deleted = delete_s3_file(s3_file["key_path"][i])
                if file_deleted: #if file is deleted
                    _del_size = _total_size_dltd(s3_file["size"][i])

        print(f"Total File(s) Size Deleted: {_del_size} MB")
    except:
        print ("failure:", sys.exc_info()[1])
        print(f"Total File(s) Size Deleted: {_del_size} MB")

I hope i have been able to help you solve a problem you may be having. This tutorial is incomplete unless you critique it as it makes be work hard to be better at what i do. Leave comments if you have to. You are free to copy/replicate the above for your use but know that you use it at your own risk. I am not responsible to any damages whatsoever that may arise from your use either to you or a third party.

Discussion (1)

Collapse
sitharthan profile image
Sitharthan C

Superb Explanation.