If you find yourself collaborating with a Data Product team, there's a high probability that you'll encounter the need to access data residing in an S3 bucket located in a different account. While this task may not be overly complicated, it isn't always as straightforward as one might hope, leading many colleagues to seek guidance on the matter. In the following article, I'll provide you with a comprehensive guide on how to effortlessly establish cross-account access between Athena and S3 in just two simple steps. So, without further ado, let's delve into the details!
Step 1: Bucket Policy in Source Account
First thing you need to do is to create a bucket policy in the source account (where the data in S3 actually is) to allow for the "client" account to read data. Go to the bucket in S3, click in the "permissions" tab and edit the bucket policy as follows:
{
"Version": "2012-10-17",
"Id": "cross-account-bucket-policy-ney",
"Statement": [
{
"Sid": "CrossAccountPermission",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<CLIENT-ACCOUNT-NUMBER>:root"
},
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload"
],
"Resource": [
"arn:aws:s3:::<BUCKET-NAME>",
"arn:aws:s3:::<BUCKET-NAME>/silver/titanic/*"
]
}
]
}
This bucket policy allows for all users within the client account to access this bucket and get objects within the /silver/titanic/
folders. If you want to restrict access to a specific user, you can replace root
for user/<USERNAME>
, for instance user/neylson.crepalde
.
Step 2: Create an external table in Athena in Client Account
In the client account navigate to Athena console and create a new query editor. Run a SQL statement to create an external table pointing to the S3 bucket in the source account. In our case, we are testing with the (very famous) TITANIC dataset partitioned by one of its columns, pclass
, as a delta table:
CREATE EXTERNAL TABLE `titanicdelta`(
`passengerid` int,
`survived` int,
`name` string,
`sex` string,
`age` double)
PARTITIONED BY (
`pclass` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<BUCKET-NAME>/silver/titanic/_symlink_format_manifest'
Because this data is partitioned, after creating the external table you have to load data partitions with the following SQL command:
MSCK REPAIR TABLE `titanicdelta`
And you're done! Now, if you query your data in the client account:
By following these two straightforward steps, you can establish a secure and efficient cross-account connection between Athena and S3. This enables you and your Data Product team to access and analyze data stored in a remote S3 bucket effortlessly. So, the next time you encounter the need to access data from another account, you can confidently navigate the process and achieve your goals without any hassle.
Remember, fostering collaboration and enabling seamless data access across different accounts is crucial for efficient and streamlined workflows within a Data Product team. By mastering this skill, you can enhance your productivity and contribute to the success of your projects.
Top comments (0)