At the risk of upsetting Jeff Bezos, I recently moved a few million PDF files from Amazon S3 to Azure Storage (Blob Storage, in fact). I kept it simple and opted to use Microsoft's AzCopy tool. It's a command-line tool that allows you to copy blobs or files from or to an Azure Storage account. AzCopy also integrates with the Azure Storage Explorer client application, but using a UI wasn't ideal with the number of files I had.
In this post, I'd like to show an authorization "gotcha" to keep in mind and a few things I learned that might help you.
After you install AzCopy—the installation is a .zip or .tar file depending on your environment—you'll need to let AzCopy know that you are authorized to access the Amazon S3 and Azure Storage resources.
From the Azure side, you can elect to use Azure Active Directory (AD) or a SAS token. For me, I ran
azcopy login to log in to Azure with my credentials. If you want to run inside a script or have more advanced use cases, it's a good idea to authorize a managed identity.
With ownership access to the storage account, I thought that was all I needed. Not so fast!
You will also need one of these permissions in Azure AD:
- Storage Blob Data Reader (downloads only)
- Storage Blob Data Contributor
- Storage Blob Data Owner
Note : Even if you are a Storage Account Owner, you still need one of those permissions.
You'll need to grab an Access Key ID and AWS Secret Access Key from Amazon Web Services from the AWS side. If you're not sure how to retrieve those, check out the AWS docs.
From there, it's as easy as setting a few environment variables (I'm using Windows):
I needed to copy all the files from a public AWS directory with the pattern
/my-bucket/dir/dir/dir/dir/ to a public Azure Storage container. To do that, I called
azcopy like so:
azcopy "https://s3.amazonaws.com/my-bucket/dir/dir/dir/dir/*" "https://mystorageaccount.blob.core.windows.net/mycontainer" --recursive=true
This command allowed me to take anything under the directory while also keeping the file structure from the S3 bucket. I knew that it was all PDF files, but I could have also used the
--include-pattern flag like this:
azcopy "https://s3.amazonaws.com/my-bucket/dir/dir/dir/dir/*" "https://mystorageaccount.blob.core.windows.net/mycontainer" --include-pattern "*.pdf" --recursive=true
There's a lot of flexibility here—you can specify multiple complete file names, wildcard characters (I could have set multiple file types here), and even based on file modified dates. I might need to be more selective in the future, so I was happy to see all the options at my disposal.
If running AzCopy for a while, you might deal with a stopped job. It could be because of failures or a system reboot. To start where you left off, you can run
azcopy jobs list to get a list of your jobs in this format:
Job Id: <some-guid> Start Time: <when-the-job-started> Status: Cancelled | Completed | Failed Command: copy "source" "destination" --any-flags
With the correct job ID in hand, I could run the following command to pick up where I left off:
azcopy jobs resume <job-id>
If you need to get to the bottom of any errors, you can change the default log level (the default is
INFO) and filter by jobs with a
Failed state. AzCopy creates log and plan files for every job you run in the
%USERPROFILE%\.azcopy directory on Windows.
After you finish, you can clean up all your plan and log files by executing
azcopy jobs clean (or
azcopy jobs rm <job-id> if you want to remove just one).
Microsoft recommends that individual jobs contain no more than 10 million files. Jobs that transfer more than 50 million files can suffer from degraded performance because of the tracking overhead. I didn't need to worry about performance, but I still learned a few valuable things.
To speed things up, you can increase the number of concurrent requests by setting the
AZCOPY_CONCURRENCY_VALUE environment variable. By default, Microsoft sets the value to 16 multiplied by the number of CPUs on your machine—if you have less than 5 CPUs, the value is 16. Because I have 12 CPUs, AzCopy set the
AZ_CONCURRENCY_VALUE to 192.
If you'd like to confirm, you can look at the top of your job's log file.
2021/11/19 16:39:20 AzcopyVersion 10.13.0 2021/11/19 16:39:20 OS-Environment windows 2021/11/19 16:39:20 OS-Architecture amd64 2021/11/19 16:39:20 Log times are in UTC. Local time is 19 Nov 2021 10:39:20 2021/11/19 16:39:20 Job-Command copy https:/mystorageaccount.blob.core.windows.net/my-container --recursive=true 2021/11/19 16:39:20 Number of CPUs: 12 2021/11/19 16:39:20 Max file buffer RAM 6.000 GB 2021/11/19 16:39:20 Max concurrent network operations: 192 (Based on number of CPUs. Set AZCOPY_CONCURRENCY_VALUE environment variable to override) 2021/11/19 16:39:20 Check CPU usage when dynamically tuning concurrency: true (Based on hard-coded default. Set AZCOPY_TUNE_TO_CPU environment variable to true or false override) 2021/11/19 16:39:20 Max concurrent transfer initiation routines: 64 (Based on hard-coded default. Set AZCOPY_CONCURRENT_FILES environment variable to override) 2021/11/19 16:39:20 Max enumeration routines: 16 (Based on hard-coded default. Set AZCOPY_CONCURRENT_SCAN environment variable to override) 2021/11/19 16:39:20 Parallelize getting file properties (file.Stat): false (Based on AZCOPY_PARALLEL_STAT_FILES environment variable)
You can tweak these values to see what works for you. Luckily, AzCopy allows you to run benchmark tests that will report a recommended concurrency value.
This was my first time using AzCopy for any serious work, and I had a good experience. It comes with a lot more flexibility than I imagined and even has features for limiting throughput and optimizing memory use.
To get started, click the link to begin using AzCopy—and let me know what you think of it!