Introducing s3verify: verify that a local file is identical to an S3 object without having to download the object data

#aws #s3 #opensource

In February 2022, Amazon S3 released a new checksum feature that allows for integrity checking without having to download the object data (blog post, documentation). Today, I'm happy to announce s3verify, a new program that I've developed related to this feature. Before I talk about my program, I want to explain what the new S3 feature is and why it is so useful.

Previously, the only built-in way to attempt any kind of standardized verification like this was by using the object ETag. However, the ETag is only usable for this purpose if the object is unencrypted, which is not acceptable for most users these days. For encrypted objects, the ETag is most likely a checksum of the ciphertext, which is probably all that the S3 error-checking process requires in order to verify that the data hasn't been corrupted. This has been a bit of a sorry state of affairs for a long time, forcing S3 users to come up with their own verification schemes. I am guessing that many big customers of Amazon have been asking them to address this, and earlier this year change finally arrived!

When you upload a file to S3, you can now specify a checksum algorithm (SHA-1, SHA-256, CRC-32, or CRC-32C). Your client will be computing the checksum while the upload is taking place and submit the checksum at the end using an HTTP trailer. While S3 is receiving the data, they will perform the same checksum computation on their side, and at the conclusion of the upload they will reject the upload if the two checksums do not match. This checksum is then immutably stored in the object metadata for the lifetime of the object, making it available afterwards without the need to download the object data. It is impossible to modify or accidentally remove the checksum from the object. Having this checksum easily accessible is especially useful for objects on Glacier, which can be very costly and take days to retrieve.

If you have existing objects that were not uploaded with a checksum algorithm then you need to either make a copy of the object (using CopyObject with the x-amz-checksum-algorithm header) or by uploading the object from scratch with a checksum algorithm selected. This procedure might be a good subject for a future blog post.

Once you have an S3 object with a checksum, you may ask yourself: now how do I verify it? 🤔

Unfortunately, Amazon hasn't released any tool of their own to perform this verification, even though it has been 6 months since the introduction of the feature. I expected an aws cli subcommand to eventually appear, but it hasn't happened. They did release some Java reference code that uses the AWS SDK on this documentation page, but that is very hard for most people to use.

I decided to fill this gap by building s3verify. It allows you to very easily verify that a local file is identical to an S3 object, without the need to download the object data. It only works on objects that were uploaded using this new checksum feature.

The program is very simple, simply invoke it and point it at a local file and an S3 object, and it will tell you if they are identical:

$ s3verify important-backup-2021.zip s3://mybucketname/important-backup-2021.zip
Fetching S3 object information...
S3 object checksum: x5AZd/g+YQp7l0kwcN8Hw+qqXZj2ekjAHG0yztmkWXg=
Object consists of 21 parts.

Part  1: fiP2aEgcQGHHJgmce4C3e/a3m50y/UJHsYFojMS3Oy8=  OK
Part  2: /lRdaagPhvRL9PxpQZOTKLxr1+xX64bYN6hknuy9y3k=  OK
Part  3: nS/vLGZ13Cq7cGWlwW3QnLkJaDTRrY8PUgsCGs9abKU=  OK
Part  4: HJWCIDAo8MY0nk5m4uBvUJ5R0aZzPAWJPE9F9WheEAk=  OK
Part  5: JExPU8KHhBJ1K+fh/p0gNT50ueRi6BxOL3XXSvHVUgQ=  OK
Part  6: gyp/OaxJqKz1mYWAZadtNhBgqEXpDUvMVuIZybwD1ic=  OK
Part  7: 1RcmmE8STey0mE33MXrzFAXbWrjawaVbnXeX5GB/F/Y=  OK
Part  8: XdcyPdbc2OYgF0NE/c9Q5vBgI8BXlv8tLZB3g6ETvlI=  OK
Part  9: pOKv/u4hlfGEpaBE5YTKA3IlVQDY+hMlySbdh9dfqsI=  OK
Part 10: W4WKSjF+blMinRdP9EcJ9mSDQMMyAUn0KfFgCWv8ZxI=  OK
Part 11: nP35yqHA+Pgum8yWeeXRZU/jPGF/ntnAR+rqOcwlhqk=  OK
Part 12: aoEWVZnc/8ualswzKmMXWZaQg/Bg/4zFs1MGQQTpHV0=  OK
Part 13: LVMnzhFxBPzFfVRFzilrfNCPX8zJhu1jNSNn7cZYmew=  OK
Part 14: OrcQx1cNqtatD6WGf4kA2R/ld7rVzQTkzbL9rAtYLDY=  OK
Part 15: 1+1AxALVTubSsdBW1qXs2toyCLDpq81I+ivFKPAzogs=  OK
Part 16: 3kPLbv0PCSlATrTOdzin03KbAezfi165l1Tq09gAN0Q=  OK
Part 17: IPTEvMXa/ZZe8IabeFDNWAF8hBV7dwNsu3wXJrBHwRE=  OK
Part 18: IOhxLxcmmqWvRi+y6ITVaPyFLzjo4wAB4f7e7I6CFYc=  OK
Part 19: tGCw1J2c2dYlZdxlxvLX+w4r6Cp9S5WhN7hJeRXJMUo=  OK
Part 20: sMH7Jh9qH/nUOue0/oBaaPYJXf8S81j6p7LoMub+7H8=  OK
Part 21: q5W9UMl7As4VVuEJcdvQC1ENyAVM2AlLc9utiEF4v4E=  OK

Checksum of checksums: x5AZd/g+YQp7l0kwcN8Hw+qqXZj2ekjAHG0yztmkWXg=

Checksum matches! File and S3 object are identical.

If the checksums do not match then you will see the following:

Checksum MISMATCH! File and S3 object are NOT identical!

If the file size and S3 object size do not match then you will see a similar error (in this case hashing will not be attempted).

I hope that s3verify will be useful to you. Please file an issue in the GitHub repository if you have any problems using it. It is a perfect companion to my earlier S3 programs, shrimp and s3sha256sum.

P.S. Unfortunately, aws s3 cp doesn't yet have a --checksum-algorithm argument. It is very strange that they haven't added this yet. However, you can use shrimp in the meantime as it fully supports uploading objects with this new checksum feature.

P.P.S. There is another legacy project called s3verify that is currently ranked higher on most search engines. It is unrelated to checking object integrity. Hopefully my project will overtake it in search rankings soon.

DEV Community

Introducing s3verify: verify that a local file is identical to an S3 object without having to download the object data

Top comments (0)

Read next

Enhanced Observability for Amazon EKS with CloudWatch Container Insights

Building an actionable ⚡️ GitHub account with these open-source projects 👩🏻‍💻

Top re:Invent 2024 Videos

AWS EKS Auto Mode: Automating Kubernetes Cluster Management