DEV Community

selvakumar palanisamy
selvakumar palanisamy

Posted on

Data sharing using lake AWS lake formation

Lake formation enables organisations to securely share data between business units and scale the solution without causing headaches.Data can be stored in different AWS accounts belonging to different teams.

AWS services which enable data sharing
1) AWS Glue
2) Lake formation

AWS Glue

AWS Glue is a managed service that allows for the crawling of data repositories to aid in the creation of a data catalogue.
Jobs are Extract, Transform, and Load (ETL) tools provided by Glue. One of the difficulties is that you can share access in Glue using role-based access control with IAM roles and policies, but this necessitates knowledge of the underlying storage mechanism.
You must also create policies for both the Glue Catalog and the S3 Bucket.

AWS Lake formation

AWS Lake Formation simplifies access management and resource sharing across accounts.Lake Formation offers a straightforward granting mechanism that SQL experts will recognise.
These grants can be made to IAM identities, AWS accounts, or an entire AWS Organisation or OU.

Lake Formation integrates with AWS Resource Access Manager after creating a grant to create a cross-account resource share.
The shared catalogue resources will then be visible in the local data catalogue of Lake Formation administrators in the target account.

Solution Overview

Share data across AWS accounts to enable a multi-source data analytics solution.

Solution Components

Centralized Datalake account

1) Store the data in S3
2) catalog that data, so that the data is visible, and schema is known
3) share that data to other AWS accounts

Consuming Data account
1) Query the data in the source account

The diagram below shows how those components can work together to provide this solution:

Image description

Setting up Lake formation

1) A Lake Formation Administrator should be assigned.
The administrator will then be able to manage access to data catalogue resources both within and across accounts.
Lake Formation administrators can be either IAM users or IAM roles.

2) Change the Lake Formation permission model from IAM to Lake Formation native grants

Image description

3) Establish centralized datalake
4) Upload the file using AWS CLI or from the S3 console

You can upload the file using AWS CLI or from the S3 console:
aws s3 sync . s3://my-source-bucket

5) Add the crawler and give it permission to read the bucket and write to the catalog
Use this cfn stack to deploy the resources mentioned in the above steps

AWSTemplateFormatVersion: '2010-09-09'
Description: My data lake source

Resources:
  LakeformationSettings:
    Type: AWS::LakeFormation::DataLakeSettings
    Properties:
      Admins:
        - DataLakePrincipalIdentifier: arn:aws:iam::XXXXXXXXX:role/aws-reserved/sso.amazonaws.com/ap-southeast-2/AWSReservedSSO_AWSAdministratorAccess_85c5426c350156b8
  MySourceDataStore:
    Type: AWS::S3::Bucket
    DeletionPolicy: Delete
    Properties:
      AccessControl: Private
      BucketName: !Sub 'my-source-data-store-${AWS::Region}-${AWS::AccountId}'
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256

  MySourceGlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: my-source-glue-database-demo
        Description: String
  MySourceCrawlerRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: 'Allow'

            Principal:
              Service:
                - 'glue.amazonaws.com'
            Action:
              - 'sts:AssumeRole'
      Path: '/'
      Policies:
        - PolicyName: 'root'
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 'glue:*'
                Resource: '*'
              - Effect: Allow
                Action:
                  - 'logs:CreateLogGroup'
                  - 'logs:CreateLogStream'
                  - 'logs:PutLogEvents'
                  - 'logs:AssociateKmsKey'
                Resource: '*'
              - Effect: Allow
                Action: 's3:ListBucket'
                Resource: !GetAtt MySourceDataStore.Arn
              - Effect: Allow
                Action: 's3:GetObject'
                Resource: !Sub
                  - '${Bucket}/*'
                  - { Bucket: !GetAtt MySourceDataStore.Arn }

  MySourceCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: my-source-data-crawler
      Role: !GetAtt MySourceCrawlerRole.Arn
      DatabaseName: !Ref MySourceGlueDatabase
      Targets:
        S3Targets:
          - Path: !Ref MySourceDataStore
      SchemaChangePolicy:
        UpdateBehavior: 'UPDATE_IN_DATABASE'
        DeleteBehavior: 'LOG'

  SourceCrawlerLakeGrants:
    Type: AWS::LakeFormation::Permissions
    Properties:
      DataLakePrincipal:
        DataLakePrincipalIdentifier: !GetAtt MySourceCrawlerRole.Arn
      Permissions:
        - ALTER
        - DROP
        - CREATE_TABLE
      Resource:
        DatabaseResource:
          Name: !Ref MySourceGlueDatabase

  DatalakeLocation:
    Type: AWS::LakeFormation::Resource
    Properties:
      ResourceArn: !GetAtt  MySourceDataStore.Arn
      RoleArn: !Sub arn:aws:iam::${AWS::AccountId}:role/aws-service-role/lakeformation.amazonaws.com/AWSServiceRoleForLakeFormationDataAccess
      UseServiceLinkedRole: true
Enter fullscreen mode Exit fullscreen mode

6) When you have deployed the crawler, you should be able to see and run it in the Glue Console

Image description

7)Cross account grant
To enable cross-account access, you will need to add a Lake Formation grant and specify the consumer account number.

CrossAccountLakeGrants:
    Type: AWS::LakeFormation::Permissions
    Properties:
      DataLakePrincipal:
        DataLakePrincipalIdentifier: "XXXXXXXXXX" # Consumer account number
      Permissions:
        - SELECT
      PermissionsWithGrantOption:
        - SELECT
      Resource:
        TableResource:
          DatabaseName: !Ref MySourceGlueDatabase
          Name: !Sub 'my_source_data_store_ap_southeast_2_${AWS::AccountId}'

Enter fullscreen mode Exit fullscreen mode

8) Setup permissions in the consumer account

Login into the consumer account and setup Lake Formation base settings:

1) Setup a Lake Formation administrator

AWSTemplateFormatVersion: '2010-09-09'
Description: My consumer data lake setup

Resources:
  LakeformationSettings:
    Type: AWS::LakeFormation::DataLakeSettings
    Properties: 
      Admins: 
        - DataLakePrincipalIdentifier: arn:aws:iam::XXXXXXXXXX:role/aws-reserved/sso.amazonaws.com/AWSReservedSSO_AWSAdministratorAccess_56cabj890003333

Enter fullscreen mode Exit fullscreen mode

2) Turn on Lake Formation grants:

Image description

Create a resource link to the database in the data lake account. Unfortunately, this is not available via CloudFormation yet. In Lake Formation console, click on databases -> create database button.

Image description

Image description

3) Open Athena console, you should be able to see your like database and table schema. Now all there is to do is to query the table and make sure it returns the result.

Image description

Discussion (0)