Lake formation enables organisations to securely share data between business units and scale the solution without causing headaches.Data can be stored in different AWS accounts belonging to different teams.
AWS services which enable data sharing
1) AWS Glue
2) Lake formation
AWS Glue
AWS Glue is a managed service that allows for the crawling of data repositories to aid in the creation of a data catalogue.
Jobs are Extract, Transform, and Load (ETL) tools provided by Glue. One of the difficulties is that you can share access in Glue using role-based access control with IAM roles and policies, but this necessitates knowledge of the underlying storage mechanism.
You must also create policies for both the Glue Catalog and the S3 Bucket.
AWS Lake formation
AWS Lake Formation simplifies access management and resource sharing across accounts.Lake Formation offers a straightforward granting mechanism that SQL experts will recognise.
These grants can be made to IAM identities, AWS accounts, or an entire AWS Organisation or OU.
Lake Formation integrates with AWS Resource Access Manager after creating a grant to create a cross-account resource share.
The shared catalogue resources will then be visible in the local data catalogue of Lake Formation administrators in the target account.
Solution Overview
Share data across AWS accounts to enable a multi-source data analytics solution.
Solution Components
Centralized Datalake account
1) Store the data in S3
2) catalog that data, so that the data is visible, and schema is known
3) share that data to other AWS accounts
Consuming Data account
1) Query the data in the source account
The diagram below shows how those components can work together to provide this solution:
Setting up Lake formation
1) A Lake Formation Administrator should be assigned.
The administrator will then be able to manage access to data catalogue resources both within and across accounts.
Lake Formation administrators can be either IAM users or IAM roles.
2) Change the Lake Formation permission model from IAM to Lake Formation native grants
3) Establish centralized datalake
4) Upload the file using AWS CLI or from the S3 console
You can upload the file using AWS CLI or from the S3 console:
aws s3 sync . s3://my-source-bucket
5) Add the crawler and give it permission to read the bucket and write to the catalog
Use this cfn stack to deploy the resources mentioned in the above steps
AWSTemplateFormatVersion: '2010-09-09'
Description: My data lake source
Resources:
LakeformationSettings:
Type: AWS::LakeFormation::DataLakeSettings
Properties:
Admins:
- DataLakePrincipalIdentifier: arn:aws:iam::XXXXXXXXX:role/aws-reserved/sso.amazonaws.com/ap-southeast-2/AWSReservedSSO_AWSAdministratorAccess_85c5426c350156b8
MySourceDataStore:
Type: AWS::S3::Bucket
DeletionPolicy: Delete
Properties:
AccessControl: Private
BucketName: !Sub 'my-source-data-store-${AWS::Region}-${AWS::AccountId}'
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
MySourceGlueDatabase:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: my-source-glue-database-demo
Description: String
MySourceCrawlerRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Principal:
Service:
- 'glue.amazonaws.com'
Action:
- 'sts:AssumeRole'
Path: '/'
Policies:
- PolicyName: 'root'
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- 'glue:*'
Resource: '*'
- Effect: Allow
Action:
- 'logs:CreateLogGroup'
- 'logs:CreateLogStream'
- 'logs:PutLogEvents'
- 'logs:AssociateKmsKey'
Resource: '*'
- Effect: Allow
Action: 's3:ListBucket'
Resource: !GetAtt MySourceDataStore.Arn
- Effect: Allow
Action: 's3:GetObject'
Resource: !Sub
- '${Bucket}/*'
- { Bucket: !GetAtt MySourceDataStore.Arn }
MySourceCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: my-source-data-crawler
Role: !GetAtt MySourceCrawlerRole.Arn
DatabaseName: !Ref MySourceGlueDatabase
Targets:
S3Targets:
- Path: !Ref MySourceDataStore
SchemaChangePolicy:
UpdateBehavior: 'UPDATE_IN_DATABASE'
DeleteBehavior: 'LOG'
SourceCrawlerLakeGrants:
Type: AWS::LakeFormation::Permissions
Properties:
DataLakePrincipal:
DataLakePrincipalIdentifier: !GetAtt MySourceCrawlerRole.Arn
Permissions:
- ALTER
- DROP
- CREATE_TABLE
Resource:
DatabaseResource:
Name: !Ref MySourceGlueDatabase
DatalakeLocation:
Type: AWS::LakeFormation::Resource
Properties:
ResourceArn: !GetAtt MySourceDataStore.Arn
RoleArn: !Sub arn:aws:iam::${AWS::AccountId}:role/aws-service-role/lakeformation.amazonaws.com/AWSServiceRoleForLakeFormationDataAccess
UseServiceLinkedRole: true
6) When you have deployed the crawler, you should be able to see and run it in the Glue Console
7)Cross account grant
To enable cross-account access, you will need to add a Lake Formation grant and specify the consumer account number.
CrossAccountLakeGrants:
Type: AWS::LakeFormation::Permissions
Properties:
DataLakePrincipal:
DataLakePrincipalIdentifier: "XXXXXXXXXX" # Consumer account number
Permissions:
- SELECT
PermissionsWithGrantOption:
- SELECT
Resource:
TableResource:
DatabaseName: !Ref MySourceGlueDatabase
Name: !Sub 'my_source_data_store_ap_southeast_2_${AWS::AccountId}'
8) Setup permissions in the consumer account
Login into the consumer account and setup Lake Formation base settings:
1) Setup a Lake Formation administrator
AWSTemplateFormatVersion: '2010-09-09'
Description: My consumer data lake setup
Resources:
LakeformationSettings:
Type: AWS::LakeFormation::DataLakeSettings
Properties:
Admins:
- DataLakePrincipalIdentifier: arn:aws:iam::XXXXXXXXXX:role/aws-reserved/sso.amazonaws.com/AWSReservedSSO_AWSAdministratorAccess_56cabj890003333
2) Turn on Lake Formation grants:
Create a resource link to the database in the data lake account. Unfortunately, this is not available via CloudFormation yet. In Lake Formation console, click on databases -> create database button.
3) Open Athena console, you should be able to see your like database and table schema. Now all there is to do is to query the table and make sure it returns the result.
Top comments (0)