Mitansh Gor for Distinction Dev

Posted on Nov 18, 2023 • Edited on Nov 20, 2023

S3 Multi-Part Upload: Part 2 Conclusion

#aws #bigdata #s3 #multipart

Howdy, my tech-savvy pals! 🌟 Remember our last rendezvous? We chatted about the multipart upload basics - the whole shebang! Today, get ready to roll up your sleeves because we're plunging into the deep end! 💦 We're talking all about those tricky low-level hurdles you might encounter while playing around with multipart stuff using Node.js + the Serverless framework. But hey, fear not! We're a dynamic duo, and together, we'll smash these challenges and soar to victory! 💪🚀

🌟 Remember our chat about multipart upload (Part 1)? It's like a three-step dance for your data! First, you start the upload party. Then, you groove through uploading the object parts. Finally, when all the parts are in place, you wrap up the multipart upload! 🚀📦

But before we take the plunge into each step's deep end, let's first secure our multipart permission set. Ready, set, go! 🏁



# serverless.yml
provider:
    ....
    iamRoleStatements:
            ......
        - Effect: Allow
          Action:
            - s3:GetObject
            - s3:PutObject
            - s3:AbortMultipartUpload
          Resource:
            - arn:aws:s3:::${Bucket}/*
        - Effect: Allow
          Action:
            - s3:ListBucketMultipartUploads
          Resource:
            - arn:aws:s3:::${Bucket}

🛠️ "PERMISSIONS ADDED!" 🛠️

Now that the permissions are in check, nothing's holding us back from diving deep into each thrilling step of the multipart saga! 💪🌊

Multipart Upload Initiation

Let's dive right into the exciting world of Multipart Upload Initiation with this code snippet! This magical piece of code helps us kickstart the multipart upload process for an S3 bucket and key. Remember, it's crucial to save the response (upload ID) this function gives us. This upload ID is like the secret key to the opened stream, and we'll need it for the rest of our multipart adventure! 🗝️✨



/**
 * Create a multipart upload for a given S3 bucket and key.
 *
 * @param {string} bucket - The S3 bucket name.
 * @param {string} key - The S3 object key.
 * @returns {Promise<string>} The upload ID for the multipart upload.
 * @throws {Error} If there's an issue with the multipart upload creation.
 */
const createMultipartUpload = async (bucket, key) => {
  if (typeof bucket !== 'string' || !bucket.trim()) {
    throw new Error('Invalid bucket name. Please provide a valid string value for the bucket.')
  }

  if (typeof key !== 'string' || !key.trim()) {
    throw new Error('Invalid object key. Please provide a valid string value for the key.')
  }

  try {
    const params = {
      Bucket: bucket,
      Key: key
    }

    // Use S3's createMultipartUpload with promise()
    const data = await S3.createMultipartUpload(params).promise()

    return data.UploadId
  } catch (error) {
    throw new Error(`Error creating multipart upload: ${error.message}`)
  }
}

Uploading Parts - The 'Part-y' Begins

Uploading chunks or parts might seem like a breeze, but here's the catch: the multipart upload has its own set of rules. One big rule? Chunk sizes must be greater than 5GB. But sometimes, we might encounter situations where certain chunks/parts are less than 5MB. 🤔😅

Handling these scenarios and validating them is crucial. Imagine, our multipart adventure encountering these unexpected smaller chunks - we need to be ready to address them! 💡📦💻

To tackle this challenge, we're diving into 'chunk mode'. Picture this: we've got an array of files fetched with their sizes from S3. Now, we're slicing and dicing this array into chunks, making sure each chunk has at least 5MiB of data. 🎲✂️
Check out the magic of chunkifying:



// Below is the chunk result of the list of metadata of files fetched from s3.
chunkifyArray = [
  [
    {
      fileName: 'A.csv',
      ContentLength: 1728033, // 1.64798069 MiB
    },
    {
      fileName: 'B.csv',
      ContentLength: 53326970, // 50.856561661 MiB
    },
  ],
  {
    fileName: 'C.csv',
    ContentLength: 21646619, // 20.643824577 MiB
  },
  [
    {
      fileName: 'D.csv',
      ContentLength: 1728033, // 1.64798069 MiB
    },
    {
      fileName: 'E.csv',
      ContentLength: 5226970, // 4.984827042 MiB
    },
  ],
]

See how we cleverly grouped those files into chunks? It's like solving a puzzle! 🧩💻 This way, we ensure each chunk meets our 5MB rule.

So, picture this: files A and B are like two peas in a pod, individually less than 5 MiB each, but when they team up, they surpass that mark. Meanwhile, file C is a lone ranger, confidently exceeding the 5 MiB mark all by itself. Then we've got files D and E, best buddies, also teaming up to go beyond that 5 MiB limit.

This clever strategy ensures our chunks are just the right size for this multipart upload adventure! 🚀🔍

Now, for files A and B, we're planning a little readStream party! 🎉📚 We'll grab the records from both files, blend them into one mighty string, and that fusion will become the uploadable part. Think of it as a superhero team-up! 💪🦸‍♂️ The same goes for the dynamic duo, files D and E.

But hey, file C is a solo act. We'll simply read its data and smoothly upload it via stream. 🌟💾

Imagine this snippet as our trusty guide to converting an array of files' metadata into chunks. Buckle up, we're diving into some code magic! ✨🚀




/**
 * Converts an array of file metadata into chunks based on a size threshold.
 *
 * @param {Object[]} data - Array of file metadata objects.
 * @param {number} THRESHOLDLIMIT_5MB - Size threshold for chunking.
 * @returns {Object[]} Array of chunks with grouped file metadata.
 */
const convertToChunks = async (data, THRESHOLDLIMIT_5MB) => {
  const chunkifyArray = [];
  let totalSize = 0;

  data.map((data) => {
    if (data.ContentLength) {
      totalSize += data.ContentLength;

      if (chunkifyArray.length === 0) {
        if (data.ContentLength >= THRESHOLDLIMIT_5MB) {
          chunkifyArray.push(data);
        } else {
          const obj = {
            content: [data],
            size: data.ContentLength,
          };
          chunkifyArray.push(obj);
        }
      } else {
        const currentRec = chunkifyArray[chunkifyArray.length - 1];

        if (
          currentRec.size === undefined ||
          (currentRec.size !== undefined && currentRec.size > THRESHOLDLIMIT_5MB)
        ) {
          if (data.ContentLength >= THRESHOLDLIMIT_5MB) {
            chunkifyArray.push(data);
          } else {
            const obj = {
              content: [data],
              size: data.ContentLength,
            };
            chunkifyArray.push(obj);
          }
        } else {
          // push into existing element of chunkifyArray
          chunkifyArray[chunkifyArray.length - 1].content.push(data);
          chunkifyArray[chunkifyArray.length - 1].size += data.ContentLength;
        }
      }
    }
  });

  return { chunkifyArray, totalSize } ;
};

Now that we've prepared our readStream data from individual files, it's time for the grand finale: uploading each chunk or part to our multipart stream. Enter our superhero function, uploadMultiPartHelper! 💪📤



/**
 * Uploads a part of a multipart upload to an S3 bucket.
 *
 * @param {Buffer | Uint8Array | string} body - The content of the part to upload.
 * @param {string} bucket - The name of the S3 bucket.
 * @param {string} key - The key (path) where the part will be stored in the bucket.
 * @param {number} partNumber - The part number for the multipart upload.
 * @param {string} uploadId - The ID of the multipart upload.
 * @returns {object} - The ETag and partNumber of the uploaded part.
 * @throws {Error} - If any validation or upload error occurs.
 */
const uploadMultiPartHelper = async (body, bucket, key, partNumber, uploadId) => {
  try {
    const params = {
      Body: body,
      Bucket: bucket,
      Key: key,
      PartNumber: partNumber,
      UploadId: uploadId
    }
    const data = await S3.uploadPart(params).promise()
    return {
      ETag: data.ETag,
      PartNumber: partNumber
    }
  } catch (error) {
    throw new Error(`Upload failed: ${error.message}`)
  }
}

With this uploadMultiPartHelper function ready to roll, our multipart upload strategy is almost complete! 🎉 But wait, there's a twist in the tale! What if the total size of all our files doesn't exceed the 5 MiB mark? 🤔 Let's tackle that scenario head-on with another code snippet:



/**
 * Validates the scenario where the total size of all files doesn't exceed 5 MiB.
 *
 * @param {Object[]} data - Array of file metadata objects.
 * @param {number} totalSize - The total size of all files.
 * @param {number} THRESHOLDLIMIT_5MB - Size threshold for validation.
 * @returns {boolean} Indicates whether the total size is under the threshold.
 */
const chunkDataWriteIntoStream = async (initalArrayOfFileMetaData, THRESHOLDLIMIT_5MB ) => {
  const { chunkifyArray, totalSize } = await convertToChunks( initalArrayOfFileMetaData, THRESHOLDLIMIT_5MB );
  if (totalSize < THRESHOLDLIMIT_5MB) {
    // If we're under the 5 MiB mark,
    // let's manually handle the upload using s3.upload() by combining all files into one.
  } else {
    // As the total size is bigger than 5MB, handle using multipart chunks upload...

    const uploadId = await createMultipartUpload(bucket, key);
        const respArr = []
    for (let i = 0; i < chunkifyArray.length; i++) {
      const partNumber = i + 1;
      const body = fetchCombinedRecordsFromMultipleFileObjects(chunkifyArray[i]); // create this function to fetch data from chunks.      
            // push the chunkifyArray[i] record into multipart stream having uploadID and partnumber 'i'.
            const uploadResponse = await uploadMultiPartHelper( body, bucket, key, partNumber, uploadId );
        respArr.push(uploadResponse)
        }
        return respArr 
        // respArr = [
        //    { 
      //      ETag: '',
      //      PartNumber: 0
      //    }
        //  ]
  }
};

Turbocharging Processing Speed

Hey, so you know that loop thing (for (let i = 0; i < chunkifyArray.length; i++) {) we've got above, running through our chunks? 🔄 In tech terms, it's kinda like a slowpoke when we're in a rush, especially with time limits like the 15-minute cap we've got in Lambda functions. ⏳

But guess what? We've got a secret recipe to speed things up! 🌟✨

Ingredient 1: Let's chop our chunkifyArray into smaller bits and use the power of promises to run those bits all at the same time! 🎉🔪🚀 Imagine it like a well-coordinated dance where multiple chunks perform their tasks simultaneously.
But in the world of Lambda functions, there's a limit on how much they can handle within that 15-minute timeframe. Based on some real-world testing and tinkering, Lambdas typically handle around 95 to 100 MB of files within that time span. 🕒📁

Now, imagine this: what if we've got larger files than that? 🤯📦 That's where Lambda might start feeling a bit overwhelmed, like trying to fit an elephant through a mouse hole! 🐘🕳️

Ingredient 2: Now, here's the cherry on top! Instead of relying on Lambda's time constraints, let's switch gears and implement this in Step Functions. It's like upgrading to a turbocharged engine for processing! 🏎️💨 By using Step Functions' Map state and iterating through each chunk in parallel, we're hitting the fast lane! And guess what? To turbo-boost the speed even more, we can set a MaxConcurrencysetting while configuring the Map step! 🌪️🔥🌐

With this added perspective, we're preparing a recipe for success that considers all the ingredients and ensures we handle any file size without breaking a sweat! 🌟🚀🔍

The Grande Finale - Completing the Multipart Upload

After uploading every relevant part, it's showtime! We call in the big guns with the "Complete Multipart Upload" action. 🎉

Here's where the magic happens: Amazon S3 takes all those parts, arranges them in ascending order by part number, and voilà! 🎩✨ A brand new object is born! 🌟🧬 It's like assembling the Avengers - each part plays a vital role in creating the ultimate superhero object!

But wait, there's a catch! 🤔📏 Your proposed upload should be larger than the minimum allowed object size. Each part has to be at least 5 MiB in size, except for the very last part. It's like ensuring each puzzle piece is big enough, fitting the puzzle guidelines! 🧩📐

Now, let's dive into this helper function, the secret sauce that makes all of this possible! 🍝✨



/**
 * Completes a multipart upload to an S3 bucket and returns the uploaded object's location.
 *
 * @param {string} bucket - The name of the S3 bucket.
 * @param {string} key - The key or path for the uploaded object.
 * @param {Array<{ ETag: string, PartNumber: number }>} partArray - An array of parts with ETag and PartNumber.
 * @param {string} uploadId - The unique upload identifier for the multipart upload.
 * @returns {Promise<string>} A Promise that resolves to the uploaded object's location.
 * @throws {Error} Throws an error if the upload fails.
 */
const completeMultiPartUpload = async (bucket, key, partArray, uploadId) => {
  try {
    const params = {
      Bucket: bucket,
      Key: key,
      MultipartUpload: {
        Parts: partArray
      },
      UploadId: uploadId
    }

    const data = await S3.completeMultipartUpload(params).promise()
    return data.Location
  } catch (error) {
    throw new Error(error.message)
  }
}

Aborting Multipart Uploads: When Things Go Awry

Ever wondered what happens if an error sneaks into the multipart process? Money talks, and in this case, it's about those unwanted charges! 💰💸

Here's the deal: if an error occurs between the multipart processes, the multipart stream remains open, and that means the billing continues. Yikes! 😱 It's like a stage curtain that should be closed after the show - it's gotta be done for the costs to stop! 🎭🚫

That's where the magic of aborting the multipart stream comes into play! 🌟✨

So, let's dive into the superhero function, the abortMultiPartHelper! This function performs the crucial task of aborting a multipart upload in an S3 bucket. It's like the emergency exit button for our multipart process! 🚀🛑



/**
 * Aborts a multipart upload in an S3 bucket.
 *
 * @param {string} bucket - The S3 bucket name.
 * @param {string} key - The S3 object key.
 * @param {string} uploadId - The upload ID of the multipart upload.
 * @returns {Promise<Object>} A promise that resolves with the response from the S3 service.
 * @throws {Error} If any validation fails or an error occurs during the operation.
 */
const abortMultiPartHelper = async (bucket, key, uploadId) => {
  // Validation checks for bucket, key, and uploadId
  // It's like checking the keys before opening the treasure chest! 🔑💰

  try {
    const params = {
      Bucket: bucket,
      Key: key,
      UploadId: uploadId
    }
    const data = await S3.abortMultipartUpload(params).promise()
    return data
  } catch (error) {
    throw new Error(`Error during abortMultiPartHelper: ${error.message}`)
  }
}

Remember, this function helps in preventing those unwanted charges by stopping the multipart upload in its tracks! 🛑💼 It's the safety net we need backstage to ensure everything runs smoothly. 🌟🔧

Best Practices: Aborting Multipart Streams Safely

let's talk about some genius moves to avoid those unexpected wallet withdrawals! 💰💸

Imagine this scenario: an open stream that's silently siphoning money from your pocket - not cool, right? As savvy backend devs, its always recommended to create two fantastic Lambda functions that act like financial guards when working with multipart! 🦸‍♂️🔒
1️⃣ The Specific Stream Terminator: This Lambda function is your go-to buddy! It's like having a specific key to shut down any particular multipart stream gone rogue! 🗝️🛑



/**
 * Aborts a specific multipart stream based on the provided uploadId.
 *
 * @param {string} bucket - The S3 bucket name.
 * @param {string} key - The S3 object key.
 * @param {string} uploadId - The unique ID of the multipart stream to be aborted.
 * @returns {Promise<Object>} Resolves with the response from S3.
 * @throws {Error} If any validation fails or an error occurs during the process.
 */
const abortSpecificStream = async (bucket, key, uploadId) => {
  try {
    const params = {
      Bucket: bucket,
      Key: key,
      UploadId: uploadId
    }
    const data = await S3.abortMultipartUpload(params).promise()
    return data 
  } catch (error) {
    throw new Error(`Error during abortSpecificStream: ${error.message}`)
  }
}

2️⃣ The Stream Terminator Deluxe: This Lambda function is your ultimate guardian! It's designed to sweep through and close any open multipart streams from the past. 🌪️🔒




/**
 * Lists in-progress multipart uploads on a specific bucket.
 *
 * @param {string} bucket - The S3 bucket name.
 * @returns {Promise<Object>} A promise that resolves with the response containing in-progress multipart uploads.
 * @throws {Error} If any validation fails or an error occurs during the operation.
 */
const listMultiPartUploads = async (bucket) => {
  try {
    const params = {
      Bucket: bucket
    };
    const data = await S3.listMultipartUploads(params).promise();
    return data;
  } catch (error) {
    throw new Error(`Error during listMultiPartUploads: ${error.message}`);
  }
};
/**
 * Aborts all open multipart uploads for a given S3 bucket.
 *
 * @param {string} bucket - The S3 bucket name.
 * @returns {Promise<Object[]>} A promise resolving to an array containing information about aborted uploads.
 * @throws {Error} Throws an error if the operation encounters any issues.
 */
module.exports.abortMultiPart = async (bucket) => {
  try {
    // Fetches information about open multipart uploads
    const data = await listMultiPartUploads(bucket);
    const output = [];

    // Iterates through each open upload and aborts it
    for (const obj of data) {
      const key = obj.Key;
      const uploadId = obj.UploadId;
      // Aborts the multipart upload
      const response = await abortMultiPartHelper(bucket, key, uploadId);
      // Records the abort response for the upload
      output.push(response);
    }

    return output; // Returns an array containing information about aborted uploads
  } catch (e) {
    throw new Error(e.message); // Throws an error if any issues occur during the operation
  }
}

That's the secret sauce! With these Lambda heroes on our side, we're safeguarding against any unwanted ongoing expenses. 🌟💼 Now, that's smart backend development! 🔧👨‍💻

We've reached the finish line, folks! 🏁🎉 Across these two blogs, we've dived deep into everything about multipart upload—theory, practice, highs, lows, you name it! Hope you had a blast and picked up some cool new tricks along the way! 🚀📚
Dear, data trailblazers! 🚀📊 Thanks a million for joining this thrilling data adventure! 🛡️🎸 As we navigate through the digital realm, remember to stay safe, keep chasing those data dreams, and always seek out new knowledge! 🌟✨

See you down the data highway, fellow voyagers! Until next time—farewell! 👋🌟

DEV Community

S3 Multi-Part Upload: Part 2 Conclusion

Multipart Upload Initiation

Uploading Parts - The 'Part-y' Begins

Turbocharging Processing Speed

The Grande Finale - Completing the Multipart Upload

Aborting Multipart Uploads: When Things Go Awry

Best Practices: Aborting Multipart Streams Safely

References :

Top comments (0)

Read next

AWS re:Invent 2024 Reflection

Amazon Q Developer Tips: No.13 Generating perfect functions

EKS Auto Mode Unlocked for Existing Clusters with Terraform

AWS Verified Access preview non-review!