Different approaches to reduce AWS S3 file upload time using AWS-SDK v3 in NodeJS.

#aws #node #upload #s3

Performance optimization is one of the key factors while designing systems. There can be various implications regarding performance optimization in the backend systems, one of the such factors is file uploading.
There are various posts out there that elaborates how to upload files in s3 buckets using NodeJS AWS-SDK but do not elaborate all the approaches using aws-sdk client, In this post I will describe three main and some other approaches by which you can upload files to s3 bucket using nodeJS and aws-sdk v3 faster while building a backend service.

1. Buffered upload.

This is the simple and the common approach by which you upload any files to s3 bucket using aws-sdk v3.

import { S3Client } from "@aws-sdk/client-s3";
import path from "node:path";
import fs from "node:fs";
import mime from "mime";

const s3Client = new S3Client({
    region: "s3-bucket-region",
    credentials: {
        accessKeyId: process.env.AWS_S3_ACCESS_KEY_ID,
        secretAccessKey: process.env.AWS_S3_ACCESS_KEY_SECRET,
    }
});

// Returns file as buffer to be uploaded to s3 bucket
function fileToBuffer(filePath) {
  return new Promise((resolve, reject) => {
      const fileStream = fs.createReadStream(filePath);
      const chunks = [];

      fileStream.on('data', (chunk) => {
          chunks.push(chunk);
      });

      fileStream.on('end', () => {
          const buffer = Buffer.concat(chunks);
          resolve(buffer);
          fileStream.destroy();
      });

      fileStream.on('error', (error) => {
          reject(error);
      });
  });
}

async function bufferedFileUpload() {

const filePath = "<path with filename as string>";
  try{
    const objFile = await fileToBuffer(filePath);

    const s3UploadParams = {
      Bucket: process.env.AWS_S3_ACCESS_BUCKET_NAME,
      Key: `<path with filename as string in s3 bucket>`,
      Body: objFile.buffer,
      ContentType: mime.getType(filePath)
    };
    const putObjectCommand = new PutObjectCommand(s3UploadParams);
    console.time("uploading");
    const response = await s3Client.send(putObjectCommand);

    console.log(response);
    console.timeEnd("uploading");
  } catch(e) {
    console.error(e);
  } finally {
     s3Client.destroy();
  }
}

In this approach we are first converting the file to buffer using nodeJs file read-stream and then using S3Client object to prepare an upload command and finally sending the upload command.
The biggest disadvantage of this approach is as the file size grows the upload time increases.
Sadly even the returned object after the final send() doesn't include the s3 location url string.

2. Multipart upload.

import { AbortMultipartUploadCommand, CompleteMultipartUploadCommand, CreateMultipartUploadCommand, PutObjectCommand, S3Client, UploadPartCommand } from "@aws-sdk/client-s3";
import path from "node:path";
import fs from "node:fs";
import mime from "mime";

const s3Client = new S3Client({
    region: "s3-bucket-region",
    credentials: {
        accessKeyId: process.env.AWS_S3_ACCESS_KEY_ID,
        secretAccessKey: process.env.AWS_S3_ACCESS_KEY_SECRET,
    }
});

// Returns file as buffer to be uploaded to s3 bucket
function fileToBuffer(filePath: string) {
  return new Promise((resolve, reject) => {
      const fileStream = fs.createReadStream(filePath);
      const chunks = [];

      fileStream.on('data', (chunk) => {
          chunks.push(chunk);
      });

      fileStream.on('end', () => {
          const buffer = Buffer.concat(chunks);
          resolve(buffer);
          fileStream.destroy();
      });

      fileStream.on('error', (error) => {
          reject(error);
      });
  });
}

async function multipartUpload() {

  let uploadId;
  const filePath = "<path with filename as string>";
  const objFile = await fileToBuffer(filePath);
  const s3UploadParams = {
    Bucket: process.env.AWS_S3_ACCESS_BUCKET_NAME,
    Key: `<path with filename as string in s3 bucket>`,
    ContentType: mime.getType(filePath)
  };

  try {

    console.time("uploading");
    const multipartUpload = await s3Client.send(
      new CreateMultipartUploadCommand(s3UploadParams),
    );

    uploadId = multipartUpload.UploadId;

    const uploadPromises = [];
    // Multipart uploads require a minimum size of 5 MB per part.
    const partSize = 1024*1024*5;
    const parts = Math.ceil(objFile.length/partSize);

    for(let i = 0; i < parts; i++) {
      const start = i * partSize;
      const end = Math.min(start + partSize, objFile.length);
      uploadPromises.push(
        s3Client
          .send(
            new UploadPartCommand({
              ...s3UploadParams,
              UploadId: uploadId,
              Body: objFile.subarray(start, end),
              PartNumber: i + 1,
            }),
          )
          .then((d) => {
            console.log("Part", i + 1, "uploaded : ", d);
            return d;
          }),
      );
    }

    const uploadResults = await Promise.all(uploadPromises);

    let data = await s3Client.send(
      new CompleteMultipartUploadCommand({
        ...s3UploadParams,
        UploadId: uploadId,
        MultipartUpload: {
          Parts: uploadResults.map(({ ETag }, i) => ({
            ETag,
            PartNumber: i + 1,
          })),
        },
      }),
    );
    console.log("completed: ", data);
    console.timeEnd("uploading");
    return data;
  } catch (err) {
    console.error(err);

    if (uploadId) {
      const abortCommand = new AbortMultipartUploadCommand({
        ...s3UploadParams,
        UploadId: uploadId,
      });

      await s3Client.send(abortCommand);
    }
  } finally {
     s3Client.destroy();
  }
};

Using this approach we are splitting the buffer in equal chunks and asynchronously uploading each chunks.

We request for multipart upload, which generates an upload id.
Then we split the chunks into equal parts each of 5MBs and asynchronously upload each chunk by passing the uploadId.
The minimum chunks should be of size 5MB as per the aws-sdk guideline (you can choose chunk size greater than or equal to 5MB as well).
The last part may be less than 5mb, since we are selecting the buffer subarray of chunks, so if the end subarray length is greater than the buffer length then we are choosing the buffer.length as the last array chunk length.
Interestingly even if the overall buffer (file) size is less than 5MB, even then also we can upload the chunk with a single part and chunk size which is less than 5MB.
We wait for all chunk upload to be completed and then capture the ETag and partId after each chunk upload.
Lastly we send the upload complete request by acknowledging the partId and corresponding ETag of each uploaded chunk and finally it will return an object with the S3 location url and other meta-data.
If any error happens after the upload Request has been successfully made and there is any error during chunk upload, we can send the abort request using the specified uploadId.

As you have noticed this is one of a kind of cumbersome approach 😅.
This approach is supported in the aws-sdk v2 as well and is much faster than the 1st approach, but as the file size grows it is not possible to hold a large file in memory as a buffer.

3. File Upload as Stream.

aws-sdk v3 also supports streaming of files to s3 bucket which was not possible in older versions of the sdk. Using this approach we don't have to convert the files to buffers before uploading. Instead we have to create a read steam or a duplex stream of the file to upload using @aws-sdk/lib-storage npm package.

import {S3Client} from "@aws-sdk/client-s3";
import path from "node:path";
import fs from "node:fs";
import  {Upload } from "@aws-sdk/lib-storage";
import mime from "mime";

const s3Client = new S3Client({
    region: "s3-bucket-region",
    credentials: {
        accessKeyId: process.env.AWS_S3_ACCESS_KEY_ID,
        secretAccessKey: process.env.AWS_S3_ACCESS_KEY_SECRET,
    }
});

async function streamFileUpload() {
  const filePath = "<path with filename as string>";
  let fileReadStream;
  try {
    console.time("uploadling");
    fileReadStream = fs.createReadStream(filePath);
    const parallelUploads3 = new Upload({
      client: s3Client,
      params: {
        Bucket: process.env.AWS_S3_ACCESS_BUCKET_NAME,
        Key: `<path with filename as string in s3 bucket>`,
        Body: fs.createReadStream(filePath),
        ContentType: mime.getType(fileName)
      },

      tags: [
      ],
      queueSize: 4, // optional concurrency configuration
      partSize: 1024 * 1024 * 5, // optional size of each part, in bytes, at least 5MB
      leavePartsOnError: false, // optional manually handle dropped parts
    });

    parallelUploads3.on("httpUploadProgress", (progress) => {
      console.log(progress);
    });

    let data = await parallelUploads3.done();
    console.log(data);
    console.timeEnd("uploadling");

  } catch (e) {
     console.log(e);
  } finally {
     if(fileReadStream) fileReadStream.destroy();
     s3Client.destroy();
  }
}

Here we can see that we are creating a file read stream and passing the stream object to Upload object instance of the @aws-sdk/lib-storage which asynchronously upload the chunk internally. Even we can subscribe to the progress event by registering callback to the httpUploadProgress event.

Just like the 2nd approach, in this approach the minimum chunk should be of size 5MB as well and can upload files will less than 5MB.
Be careful regarding the queueSize parameter it might affect the upload time if you increase of decrease it based on the average file size.

As you can see this is very less cumbersome than the 2nd approach and takes less memory footprint as well.

This is my 1st blog in the dev forum. Please leave a feedback regarding any scope of improvements or any issue regarding this post. So that I can improve my writing skill and confidently share some unique topics with everyone 🤓.