Refactoring PDFMerger and Adding Amazon S3 Feature

#career #softwareengineering #architecture #discuss

Why Refactoring?

Refactoring Guru from their website said: "Refactoring is a systematic process of improving code without creating new functionality that can transform a mess into clean code and simple design." I need to refactor my existing code because the process business and the web API project are in one project. I want to extend the functionality to support more storage providers. Previously, the PDFMerger only supported Google Drive, and I want to use AWS S3 to provide the automation. After refactoring, I can change the storage provider easily. If you want to know about my project, I have the GitHub Repository, and it's open source. Feel free to give any feedback. Please see my initial changes in this PR.

Add S3 Functionality #7

berviantoleo posted on Feb 03, 2023

View on GitHub

Refactoring Steps

I have three steps in this application.

Download the files from a folder and store into Memory
Merge the PDFs from the list of files in memory
Upload the merged PDF into the target directory

I move those functions into the new project. I will have the "interface" or the contract in a project and the integration in another project. For example, I have IDownloader to provide a download function, IPdfMerger merge PDFs, and IUploader the upload function. Besides those contracts, I have IMerge that will be implemented by the business process.

My previous application will download the merged PDF into the local computer and needs to log in using OAuth2.0. Log in using OAuth2.0 holds me to do automation.

I decided to create a new console project that will use Amazon S3 to read and write into the bucket. Honestly, it's quite challenging because we also add some functionality after the refactoring process.

I'm going to add functionality to download the files concurrently. Currently, the application will download the files sequentially.

More Details

Let's focus on the Amazon S3 implementation. I use AWS SDK to integrate with Amazon S3. For the download process, the application will list all files (the file id or key), so we use the ListObjectsV2 API. I only provide BucketName and Prefix, the Prefix will provide the folder location. Since it might have another page, I also iterate the list and read the NextContinuationToken to make sure all files are already downloaded.

After getting the list, the application will download each file using the GetObject API. We store it in the list of MemoryStream.

How about the Upload? It's simple. The application uses the PutObject API to upload the MemoryStream.

So, what do you think? Again, if you have any feedback, feel free to share it in the comment section.

Permissions

I'm using these permissions for the policy. I create a new policy to fulfill the application function.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:PutObjectRetention",
                "s3:GetObjectVersionTagging",
                "s3:GetObjectAttributes",
                "s3:GetObjectTagging",
                "s3:ListBucket",
                "s3:GetObjectVersionAttributes",
                "s3:GetObjectVersion"
            ],
            "Resource": "*"
        }
    ]
}

Cron Functions

I use Github Action to have the cron function. I'm going to have a cron function in the AWS. If you have any recommendations, feel free to comment on this post. I plan to use Lambda with container images and trigger the Lambda using Amazon EventBridge. If you want to know the cron in Github Action, feel free to visit this page and the source code.

Thank you

Thank you for reading. I will not go deep dive into the code. I just want to share my experience and the process of migrating. It might be stressful if I rewrite the code in this post.

Top comments (1)

Comment deleted