Sidra Saleem for SUDO Consultants

Posted on May 31 • Originally published at sudoconsultants.com

AWS DataSync Accelerating Data Transfer for Software and Hardware Teams

#datatransfer #datasync #aws #cloud

Introduction

Overview of AWS DataSync

AWS DataSync is a secure, online service that helps to accelerate the movement of data across on-premises applications and Amazon Web Services, or across AWS services. Source: DataSync can deal with numerous data sources such as NFS shares, SMB shares, HDFS, self-managed object storage, AWS Snowcone, Amazon S3 buckets, Amazon EFS file systems, and different FSx file systems. This makes transfers from AWS easier to other public clouds, meaning automatic replication, archiving, or even sharing of application data.

Importance of Efficient Data Transfer in Software and Hardware Teams

While data is the new oil, the effective transfer of data has the following significance:

Speed and efficiency: Keeping data as current as possible with low latency speeds up applications or services within an organization. Also, it automates and accelerates the transfer process of data, finally reducing manual effort in the process of data migration with synchronization, thereby saving the team both time and money.
Reliability and security: Amazon Data Sync includes built-in integrity and security measures, which ensure a safe and reliable data transfer solution to get rid of risks of loss/breach.
Scalability: Scalable data transfer solutions like AWS DataSync can scale up on demand without at the same time sacrificing performance, enabling businesses to grow both data and geographically.
Flexibility: AWS DataSync supports a very wide range of sources and destinations, therefore offering flexibility to match any change in your business needs and/or in your technology stack.

All these benefits make AWS DataSync an invaluable tool for teams that work on either side of software and hardware to ensure optimized data workflows, efficient operations, data availability, integrity, and order in the environment.

Understanding AWS DataSync

How AWS DataSync Works

AWS DataSync is an online service for data movement and discovery that enables you to very quickly, easily, and securely transfer your file or object data to and from and between AWS Storage Services. It uses a purpose-built network protocol and parallel multi-threaded architecture to accelerate the speed of transfers and fits well for migrations, recurring data processing workflows for analytics and machine learning, and data protection operations. This way, by using the purpose-built network protocol and parallel, multi-threaded architecture, DataSync is able to quicken the speed of transfers, yet ensure end-to-end encryption and validation of integrity so that your data is delivered in a secure and undamaged form, prepared for use. It accesses AWS storage using the built-in security mechanisms of AWS, such as AWS Identity and Access Management roles and supports virtual private cloud endpoints for transfer of data between sites without traversing the public Internet.

Components and Terms Used in DataSync

DataSync Resources:

These are agents, locations, and tasks. These are the most important members of AWS DataSync. Where agents are what have to be installed on your on-premise servers or virtual machines to help in data transfer. Locations represent the source and destination of data transfers that could be on-premises and in the form of an AWS storage service, and at the same time, tasks are configurations defining the details of the data transfer—for example, the source and destination locations, data to be transferred, and transfer options.

IAM Policies:

Basically, identity-based policies, IAM policies, are used to manage access to DataSync resources. Policy or policies that can either allow or deny permissions of creating and managing of resources that is DataSync are agents, locations, and tasks. They might as well allow the permission to be accessed by a role that is in another AWS account or an AWS service. It includes the use of cross-account access permissions and integration with some other AWS services.

API Operations:

For this resource DataSync defines a set of API operations in the management of the same. They include how to create, delete, and describe tasks. Permission to do that gives the applications by IAM policies, that is, it allows actions, effects, resources, and principles.

IAM Permissions Required for DataSync

To manage DataSync resources effectively, the following are IAM permissions required:

Create and Manage DataSync Resources:

For this resource, it involves an IAM policy permitting an IAM role within my AWS account to create and manage DataSync resources – agents, locations, and tasks. This allows for the creation, updating, and deletion of such resources.

Cross-account access:

If there is ever a need where you allow permissions to a role that belongs to another AWS account or an AWS service, you can write an IAM policy that lays out the permissions in black and white on what resources in your account and attach a trust policy to the role. The trust policy will define the principal (another AWS account or AWS service) to allow assuming the role. That way, these other account or service users have access to the resources in your account or are able to create resources.

Specifying elements of the policy:

While defining the IAM policies on DataSync, there is the need to specify the actions like datasync: CreateTask, effects (Allow or Deny), and the resources via Amazon Resource Names (ARNs). Also, one may specify the principals (a user or service for whom the policy is created). DataSync supports only identity-based policies that is IAM policies and does not support resource-based policies.

This understanding about the components and permissions will guide one to be in a position to manage and operate AWS DataSync in an efficient manner to achieve the transfer of your data.

Setting Up AWS DataSync

Installation and Setup of AWS CLI

Before you can use AWS DataSync, you need to download and configure the AWS Command Line Interface (CLI). The AWS CLI is a unified tool to manage your AWS services from a terminal session on your own client. With just one tool to download and setup, you can control multiple AWS services from the command line and automate them through scripts.

Installation: Follow the instructions at the AWS CLI User Guide to install the AWS CLI on your system.
Configuration: Once installed, configure the AWS CLI with your AWS credentials by running aws configure and inputting your AWS Access Key ID, Secret Access Key, default region name and output format.

Creating an AWS DataSync Agent

An AWS DataSync agent is a software component that is deployed in your storage environment in order to transfer data. The agent starts and runs in your AWS account.

Deployment: Deploy the DataSync agent on your on-premises server or virtual machine as per the instructions given in the DataSync User Guide.
Activation: Activate the agent using the AWS CLI and the create agent command. This step registers an agent with your AWS account.

Creating AWS DataSync Locations

Locations in AWS DataSync point at your source and destination for data transfer. Every task would need a pair of locations.

Creating Source Location: Use the create-location-s3 command for an Amazon S3 bucket, create-location-fsx-windows for an Amazon FSx for Windows File Server, or create-location-hdfs for an HDFS cluster. For example, to create an S3 location, you might use a command like:

aws datasync create-location-s3 --s3-bucket-arn arn:aws:s3:::mybucket --s3-config BucketAccessRoleArn=arn:aws:iam::123456789012:role/myBucketAccessRole

Creating Destination Location: Similarly, create a destination location using the appropriate command for the destination service.

Creating an AWS DataSync Task

A task in AWS DataSync defines the details of a data transfer, including the source and destination locations, the data to be transferred, and the transfer options.

Creating a Task: Use the create-task command to create a task. Specify the source and destination locations, the options for the transfer, and any filters for the data to be transferred.

Starting an AWS DataSync Task

Once a task is created, you can start it using the start-task-execution command. This initiates the data transfer according to the task's configuration.

Starting a Task: Use the start-task-execution command with the task ARN to start the task. For example:

aws datasync start-task-execution --task-arn 'arn:aws:datasync:region:account-id:task/task-id'

Filtering AWS DataSync Resources

AWS DataSync allows you to filter the data that is transferred by specifying filters in your task configuration. Filters can be based on file paths, file names, or other criteria.

Specifying Filters: When creating or modifying a task, you can specify filters using the --filters option in the create-task or modify-task command. For example, to transfer only files in a specific directory, you might use:

aws datasync create-task --source-location-arn 'arn:aws:datasync:region:account-id:location/location-id' --destination-location-arn 'arn:aws:datasync:region:account-id:location/location-id' --filters Include={Key=Path,Value=/path/to/directory}

By following these steps, you can set up AWS DataSync to efficiently transfer data between your on-premises environments and AWS storage services, or between different AWS storage services.

Using AWS DataSync with the CLI

Detailed Steps for Creating an AWS DataSync Agent Using the AWS CLI

Deploy the DataSync Agent: First, you need to deploy the DataSync agent on your on-premises server or virtual machine. This involves downloading the agent software from the AWS website and configuring it according to your environment.
Activate the Agent: Once deployed, activate the agent using the AWS CLI with the create-agent command. This associates the agent with your AWS account.

aws datasync create-agent --agent-name MyAgent --vpc-endpoint-id vpce-0abc123defgh5678 --subnet-ids subnet-0abc123defgh5678 subnet-0abc123defgh5678 --security-group-ids sg-0abc123defgh5678 --availability-zone us-west-2a

Replace placeholders, such as VPC endpoint ID, subnet IDs, security group IDs, and availability zone, with your real VPC endpoint ID, subnet IDs, security group IDs, and availability zone.

Steps for Creating AWS DataSync Locations with the AWS CLI

Create a Source Location: Use the create-location-s3 command for an Amazon S3 bucket. For an Amazon FSx for Windows File Server, use create-location-fsx-windows. For an HDFS cluster, use create-location-hdfs. Let's see how to create an S3 location:

aws datasync create-location-s3 --s3-bucket-arn arn:aws:s3:::mybucket --s3-config BucketAccessRoleArn=arn:aws:iam::123456789012:role/myBucketAccessRole

Create a Destination Location: Create the destination location similarly, using the appropriate command for the destination service.

Steps for Creating an AWS DataSync Task with the AWS CLI

Create a Task: Use the create-task command to create a task. Specify the source and destination locations, the options for the transfer, and any filters for the data to be transferred.

aws datasync create-task \

    --source-location-arn 'arn:aws:datasync:us-east-1:account-id:location/location-id' \

    --destination-location-arn 'arn:aws:datasync:us-east-2:account-id:location/location-id' \

    --cloud-watch-log-group-arn 'arn:aws:logs:region:account-id:log-group:log-group' \

    --name task-name \

    --options VerifyMode=NONE,OverwriteMode=NEVER,Atime=BEST_EFFORT,Mtime=PRESERVE,Uid=INT_VALUE,Gid=INT_VALUE,PreserveDevices=PRESERVE,PosixPermissions=PRESERVE,PreserveDeletedFiles=PRESERVE,TaskQueueing=ENABLED,LogLevel=TRANSFER

Replace the placeholders with your actual ARNs and desired task options.

Steps for Starting an AWS DataSync Task with the AWS CLI

Start a Task: Use the start-task-execution command with the task ARN to start the task.

aws datasync start-task-execution --task-arn 'arn:aws:datasync:region:account-id:task/task-id'

Filtering AWS DataSync Resources Using the CLI

Specify Filters: You can specify filters while creating or modifying a task using the --filters option in the create-task or modify-task command. For example, you could write the following to transfer files from a folder named folder:

aws datasync create-task --source-location-arn 'arn:aws:datasync:region:account-id:location/location-id' --destination-location-arn 'arn:aws:datasync:region:account-id:location/location-id' --filters Include={Key=Path,Value=/path/to/directory}

By following these steps, you can effectively use AWS DataSync with the AWS CLI to manage your data transfer tasks, including setting up agents, creating locations, defining tasks, and filtering resources.

Monitoring AWS DataSync Tasks

Describing Task Execution Using the AWS CLI

To monitor the progress of your AWS DataSync task in real-time from the command line, you can use the describe-task-execution command. This command provides detailed information about the task execution, including the current status, the amount of data transferred, and any errors encountered.

aws datasync describe-task-execution \

  --task-execution-arn 'arn:aws:datasync:region:account-id:task/task-id/execution/task-execution-id'

Replace 'arn:aws:datasync:region:account-id:task/task-id/execution/task-execution-id' with the actual ARN of your task execution.

Monitoring the Progress of an Ongoing Transfer

To monitor the progress of an ongoing transfer, you can use the watch utility in conjunction with the describe-task-execution command. This allows you to see the task execution details updated in real-time.

watch -n 1 -d "aws datasync describe-task-execution --task-execution-arn 'arn:aws:datasync:region:account-id:task/task-id/execution/task-execution-id'"

This command updates every second (-n 1) and highlights differences (-d), providing a live view of the task execution progress.

Checking the Results of a Transfer

After the task execution completes, you can use the describe-task-execution command again to check the results of the transfer. Look for the Status field in the response. If the task execution succeeds, the Status will change to SUCCESS. If the task execution fails, the response will include error codes that can help you troubleshoot issues.

aws datasync describe-task-execution \

  --task-execution-arn 'arn:aws:datasync:region:account-id:task/task-id/execution/task-execution-id'

Additional Monitoring Tools

Amazon CloudWatch: For better monitoring, you can apply Amazon CloudWatch to collect and process raw data that is produced by DataSync into human-readable, near real-time metrics. These stats are maintained for 15 months. By default, DataSync metrics data is sent to CloudWatch automatically at 5-minute intervals.
Task Statuses: You can monitor your DataSync tasks and know their state by getting to know about its status. The most used status is AVAILABLE, RUNNING, UNAVAILABLE, and QUEUED. Each of them means something different and represents varied task lifecycle stages.

Once you use the commands and tools earlier, you can monitor your tasks of AWS DataSync successfully and as a result make ensure everything is in the expected flow. If some problems occur you can response to them in a timely way.

Advanced Configuration Options for AWS DataSync

Configuring Task Options

When you create an AWS DataSync task, you have a lot of options for how the task will handle files, objects, and their metadata when transferring them. These options include:

Handling Files: You can opt to transfer only files that have changed or all files without comparing the source and destination location data. This option affects the transfer speed and efficiency.
Object Metadata: AWS DataSync retains POSIX permission for files and folders and tags associated with objects and access control lists (ACLs) when the transfer is going on. This means that the metadata for files and objects remains the same in the source and destination.
Verify Data Integrity: This is the option to perform integrity verification for the data written to the destination and that which is read from the source. Optionally, use the verification check at the end of the transfer to compare source and destination and result in full-file checksums.
Bandwidth Throttle: Quickly set the network bandwidth that AWS DataSync will use with built-in bandwidth throttle. It ensures that ongoing transfers do not affect other users or other applications that use the same network connection.
Logging: Configure the type of logs published by DataSync to an Amazon CloudWatch Logs log group.

Specifying How DataSync Checks the Integrity of Data During a Transfer

AWS DataSync guarantees data integrity for the transfer operation using integrity checks to compare the data written to the destination with the data read from the source. Optionally, you can use the verification check at the end of the transfer to compare source with destination data. It computes and compares the full-file checksum of the file data stored on the source and on the destination, letting you ascertain the transfer was successful.

Adjusting Log Levels for Individual Task Executions

During DataSync task configuration, you can specify the following: The level of detail in the logs that DataSync will eventually publish to an Amazon CloudWatch Logs log group:

BASIC: Includes log messages of a basic level, which includes transfer errors.
TRANSFER: Includes log messages of all files or objects transferred to and from a source for a DataSync task execution and data-integrity checks that were executed.
OFF: No logs are published.

By setting the log level, you can control the level of granularity of the information that is logged during a task execution. By doing this, you can focus on the details that are relevant to your monitoring and troubleshooting.

Best Practices and Considerations

Storage Class Considerations with Amazon S3 Locations

Cost Efficiency: With this setting, AWS DataSync automatically stores small objects in S3 Standard. This enables you to prevent minimum capacity charge per object. To generally reduce data retrieval fees, configure DataSync to verify only files that are transferred by a given task. DataSync provides overwriting/deleting objects controls, to avoid minimum storage duration charges.
Performance and Cost: Use DataSync in combination with S3 considering the S3 request charges. This includes data retrieval fees and storage duration charges. Choose the appropriate S3 Storage Class for the data, based on data usage patterns and the cost-of-data configuration.

Bandwidth Usage Limits for DataSync Tasks

Bandwidth Throttle: You can configure the in-built bandwidth throttle, which allows you to set the amount of network bandwidth that AWS DataSync consumes. By using this setting, you can limit the impact that AWS DataSync has on other users or the impact that applications that rely on your network connection experience.
Performance Impact: You can configure a bandwidth throttle for a task. This can reduce impact against your source file system by limiting the I/O against your storage system. Impact: With this impact, the response time for other clients that have access to the same source data store is affected.

Queueing Transfer Tasks During Multiple Task Executions

Task Queuing: DataSync task reports have the results of summary and detailed reports, in JSON-formatted output files, of all files transferred/skipped/verified/deleted in multiple task executions. This enables you to verify and audit the data transfer operations for each task run.
Monitoring and Auditing: Exploit AWS services AWS Glue, Amazon Athena, and Amazon QuickSight for automatic catalog, analysis, and visualizing of task report output. It is very effortless to keep track and audit, you can easily grok common task execution trend or failure pattern.

Choosing the Type of Logs Published to Amazon CloudWatch Logs

Log Levels: Specify the log types that are published to an Amazon CloudWatch Logs log group by DataSync. You can choose between:
- BASIC: The type of log that includes transfer errors.
- TRANSFER: The type of log that includes all information that is provided by BASIC plus detailed descriptions of all files that are transferred and all integrity-check information.
- OFF: No logs are published.
Monitoring and Troubleshooting: Using CloudWatch Logs for detailed information about the files transferred at a point in time and results of DataSync integrity verification, the difficulty monitoring, reporting and troubleshooting is eliminated and you can provide stakeholders with timely updates.

Conclusion

AWS DataSync is so essential to software and hardware teams because it provides a secure, efficient, and scalable solution to transfer data back and forth within on-premises environments and AWS storage services, or across separate AWS storage services. Advanced configuration options touch on features like file handling, object metadata preservation, data integrity checks, bandwidth throttling, and detailed logging, so that teams can make data transfer processes very specific to their needs in performance, cost, and reliability.

Future Developments and Improvements in AWS DataSync

Although the specific future developments and improvement in AWS DataSync is not highlighted in the provided sources, AWS keeps investing in rigorous improvement of its services to meet the emerging needs of the AWS customers. This is a key goal for modern IT infrastructure; data transfer is something one always hopes that AWS DataSync continues to grow toward, gaining updates that improve its performance and scalability, and that increase its integration with other AWS services and third-party solutions.