Last updated: 2020-04-05
I’ll be updating my AWS articles from time to time, as I learn more. I got my first cert — the AWS Certified Cloud Practitioner certification — on January 22nd, but as I took the practice exams (5 exams, 2x each) and the actual exam, I learned about gaps in my knowledge. So I’ll be filling those in through the articles I wrote beforehand.
S3 is object-based flat file unlimited storage. It’s unlimited, but that doesn’t mean we should throw files up there without thinking — storage still costs money. It’s not block-based, so it’s not meant for storing operating systems or live databases. But any type of file can be stored (including database file backups), and each can be from 0 bytes up to 5 TB. General knowledge about S3 is one of the key categories in the AWS Certified Cloud Practitioner exam.
Files are stored in buckets, which we can think of as root-level folders. Bucket names must be globally unique because they resolve to URLs, which are global. Most of the time, we wouldn’t expose these URLs publicly except for static S3 websites.
When we name our buckets, AWS automatically postfixes “S3” or the bucket's region to the name, depending on the region. Here are the two naming examples: us-east-1:
All other regions:
Based on the above, we may wonder why the names still need to be globally unique instead of regionally. My answer: I don't know. Maybe it's a legacy reason. I recommend using reverse-domain naming using an appropriate domain you own. For example, I start all my bucket names with com.markfreedman. An exception would be when we host a static site. In that case, we need to use a normal domain name (in my case,
markfreedman.com, although I already have this hosted elsewhere).
Based on this article and this documentation, please note that the bucket URL naming convention is changing. AWS supports both path-style requests and virtual hosted-style requests. But any buckets created after September 30, 2020 will only support virtual hosted-style requests. Also, the region should be specified in the URL. We could leave out the region, but there’s a slight bit of overhead due to AWS forcing a 307 redirect to the specific region (us-east-1 is checked first).
Here are the updated virtual hosted-style naming examples:
Without specifying a region (will 307 redirect):
I’m changing my recommended bucket naming convention slightly, due to the Bucket Names with Dots section of this article:
I recommend using reverse-domain naming using an appropriate domain you own, but replacing dots with dashes. For example, I now start all my bucket names with
com-markfreedman. An exception would be when we host a static site. In that case, we need to use a normal domain name (in my case,
markfreedman.com, although I already have this hosted elsewhere).
Although bucket names must be globally unique, storage of the buckets themselves is region-specific. We should select a bucket’s region based on latency requirements. If most access would be from a certain region, create the bucket in the closest available AWS region. Using CloudFront can alleviate this need, though.
Public access is blocked by default. AWS requires us to be explicit in exposing buckets to the public Internet. All those stories of hacked data (often exposed S3 buckets) should make us thankful for this default. We can secure buckets with IAM policies (bucket policies).
We can also set lifecycle management for a bucket, which specifies which storage class to move the bucket to, and when to move it. More on storage classes, below.
When we upload a file to an S3 bucket, AWS considers the file name to be the key, and refers it as key in the S3 APIs and SDKs. The S3 data model is a flat structure. In other words, there’s no hierarchy of subfolders (sub-buckets?). This is why I described buckets as root-level folders. However, you can simulate a logical folder hierarchy by separating portions of the key name with forward slashes (/).
The file content is referred to as the value. Therefore, an S3 file is sometimes referred to as a key/value pair.
Files can be versioned, encrypted, as well as provided with other metadata. We can secure files (objects) with IAM policies (object policies) and set ACLs at the file (object) level. By default, the resource owner has full ACL rights to the file. For extra protection, we can require multi-factor authentication (MFA) in order to delete an object.
When we upload a file to an S3 bucket, we’ll know the upload was successful if an HTTP 200 code is returned. This is most important when uploading programmatically. If we do it manually, AWS will let us know if it succeeded or not.
We can expect 99.99% availability, but AWS only guarantees 99.9%. But it also guarantees 99.999999999% durability (11 x 9s). So we can be confident that our files will always be there.
There are specific “data consistency” rules:
Read after Write Consistency — when new files are uploaded, we can read the file immediately afterwards.
Eventual Consistency — when files are updated or deleted, immediately attempting to read the file afterwards may result in the old file content. It can take a short period of time (perhaps a few seconds or more) to propagate throughout AWS (replication, cache cleaning), which is why we may see the old file.
(Update, 2020-04-05: I originally mentioned that new files use POST and updates use PUT. But according to what I can gather from their docs and some online Q&A, POST is actually an alternate to PUT that enables browser-based uploads to S3. Parameters can either be passed via HTTP headers by using PUT, or passed via form fields by using POST, no matter if the object is new or being replaced. From what I can tell, S3 doesn't really "replace" objects per se, since versioning is an option.)
S3 supports tiered storage classes, which we can change on demand at the object level. We don’t specify a class at bucket creation time. Keep in mind, when we specify lifecycle rules, we do that at the bucket level, defining the lifecycle rules for the objects in that bucket:
S3 Standard (most common) is designed to sustain loss of 2 facilities concurrently, and has the best performance.
S3 IA (Infrequently Accessed) is lower cost, but we’re charged a retrieval fee.
S3 One Zone IA is a lower cost version of S3 IA, but it doesn’t require multiple zone resilience. It’s the only tier that’s just in one availability zone; all the others are replicated in 3 or more zones.
S3 Intelligent Tiering allows AWS to automatically move data to the most cost-effective tier using machine learning AI of usage patterns. For most buckets, I recommend using this, although it’s best for long-lived data with unpredictable access patterns.
S3 Glacier is a secure, durable, low cost archival tier, which allows for configurable retrieval times, from minutes to hours. It provides query-in-place functionality for data analysis of archived data.
S3 Glacier Deep Archive is the lowest cost tier, but it requires up to 12 hour retrieval time. This is great for archived data that doesn’t need to be readily available.
S3 RRS is Reduced Redundancy Storage, but is being phased out. It appears to be similar to S3 One Zone IA.
The prices we’re charged (covered in another article) using S3 is based on:
- Storage Management
- Data Transfer
- Transfer Acceleration, which enables fast transfer to distant locations by using CloudFront edge locations, making use of backbone networks (much larger network “pipes”).
- Cross Region Replication, which automatically replicates to another region bucket for disaster recovery purposes.
- We can also configure buckets to require the requester pay for access.
- If we have multiple accounts under an Organization, S3 offers us volume discounts when we enable Consolidated Billing.