Today I am releasing two open source programs that will help you manage your data on Amazon S3: shrimp and s3sha256sum.
The first program, shrimp, is an interactive multipart uploader that is built specifically for uploading files over a slow internet connection. It doesn't matter if the upload will take days or weeks, or if you have to stop the upload and restart it at another time, shrimp will always be able to resume where it left off. Unlike the aws cli, shrimp will never abort the multipart upload (please set up a lifecycle policy to clean up abandoned multipart uploads, as described here).
While shrimp is uploading, you can use your keyboard to adjust the bandwidth limit throttler. Press ? to bring up a list of available controls. If you want shrimp to throttle your upload during the day then simply set your desired limit (use a s d f to increase the limit in different increments, and z x c v to decrease the limit). Then in the evening, you can remove the throttle with u. This feature will let you keep the upload going while not slowing down your internet too much. And if you want to move to another location that has a faster upload speed (e.g. at your work), then simply pause the upload with p, move to the new location, and unpause by pressing p a second time.
shrimp makes it a worry-free process to upload very large files to Amazon S3. I'm planning to use it to upload terabytes of data to Glacier for backup purposes. Please let me know what your experience is using it, and what improvements you can think of.
The next program was created to give you peace of mind that your uploads were in fact uploaded correctly. I wanted to verify that shrimp was doing the correct thing, and not uploading the bytes incorrectly or reassembling the parts in the wrong order. Things can go wrong, so what can we do to verify that the correct behavior is taking place?
The program is called s3sha256sum, and the name should be familiar to many of you. As the name implies, it calculates SHA256 checksums of objects on Amazon S3. It uses a normal GetObject request and streams the object contents to the SHA256 hashing function. This way there is no need to download the entire object to your hard drive. You can verify very large objects without worrying about running out of local storage.
To save on costs when you verify your objects, you should know that it may be cheaper to spin up an EC2 instance in the same region as the S3 bucket, and run s3sha256sum from that instance. This is because data transfer from S3 to EC2 is free, as the S3 pricing page clarifies:
You pay for all bandwidth into and out of Amazon S3, except for the following:
- Data transferred from an Amazon S3 bucket to any AWS service(s) within the same AWS Region as the S3 bucket (including to a different account in the same AWS Region).
If you attach the expected checksum to the object (either as metadata or a tag), then s3sha256sum can automatically compare the checksum that it just computed with the checksum that you stored on the object. It will print OK
or FAILED
depending on the outcome. For an example, see this discussion.
s3sha256sum has one more trick up its sleeve. Consider that you're running it on a 1 TB object but for some reason you have to abort it (with Ctrl-C) before it finishes. When the program is interrupted, it will get the internal state of the hash function and print a command that will let you resume the hashing from that position. For an example, see this discussion.
I think that about wraps it up. Please give shrimp and s3sha256sum a try and let me know if you find any bugs or have ideas for improvements. Thank you for reading!
Top comments (0)