Uploading and Downloading Zip Files In GCP Cloud Storage Using Python

#python #googlecloud #cloudstorage #zipfiles

GCP (Google Cloud Platform) cloud storage is the object storage service provided by Google for storing many data formats from PNG files to zipped source code for web apps and cloud functions. The data is stored in a flat, key/value-like data structure where the key is your storage object's name and the value is your data.

Object storage is great for storing massive amounts of data as a single entity, data that will later be accessed all at once as opposed to data that will be read and written in small subsets as is the case with relational and non-relational databases.

If you're looking to store a collection of files as a single unit, either to archive a large number of log files for future audits or to bundle and store code as a part of an automated deployment cycle, it's likely you will do so by packing all of it together as a zip file.

Using an application to automate the process of creating, altering, or unzipping a zip file in memory is a useful skill to have however working with memory streams and bytes rather than integers, strings, and objects can be daunting when it is unfamiliar territory.

Whether you are specifically looking to upload and download zip files to GCP cloud storage or you simply have an interest in learning how to work with zip files in memory, this post will walk you through the process of creating a new zip file from files on your local machine and uploading them to cloud storage as well as downloading an existing zip file in cloud storage and unzipping it to a local directory.

Establishing Credentials

Before you can begin uploading and downloading local files to cloud storage as zip files, you will need to create the client object used in your Python code to communicate with your project's cloud storage resources in GCP.

There are various ways to establish credentials that will grant the client object access to a cloud storage bucket, the most common of which is to create a service account and assign it to your application in one of two ways.

The first option is to assign the service account to a particular resource upon deployment. For example, if your code is being deployed as a GCP cloud function, you would attach the service account to the application upon deployment using either the gcloud sdk:

# using powershell and the gcloud sdk to deploy a python cloud function
gcloud functions deploy my-cloud-function `
--entry-point my_function_name `
--runtime python38 `
--service-account my-cloud-function@my-project-id.iam.gserviceaccount.com `
--trigger-http

Or by using an IAC (infrastructure as code) solution like Terraform:

resource "google_service_account" "my_cloud_func_sa" {
    account_id   = "my-cloud-function"
    display_name = "Cloud Function Service Account"
}

resource "google_project_iam_binding" "cloud_storage_user" {
    project = "my-project-id"
    role    = "roles/storage.objectAdmin"
    members = [
        "serviceAccount:${google_service_account.my_cloud_func_sa.email}",
    ]
}

resource "google_cloud_functions_function" "my_cloud_func" {
    name                  = "my-cloud-function"
    entry_point           = "my_function_name"
    runtime               = "python38"
    service_account_email = google_service_account.my_cloud_func_sa.email
    trigger_http          = true
}

Note that the service account as defined in Terraform is also being referenced in a google_project_iam_binding resource as a member that will be assigned the role of storage.objectAdmin. You will need to assign a similar role (or ideally one with the minimal permissions required for your code to perform its tasks) if you choose to create a service account using the GCP console.

For code being deployed with an assigned service account, creating the GCP cloud storage client in Python requires only the project id be passed as an argument to the client constructor.

from google.cloud import storage

client = storage.Client(
    project=GCP_PROJECT_ID,
)

However if you would like to upload and download to cloud storage using a CLI application or to test your cloud function before deploying it, you will want to use a locally stored JSON credentials file.

To create the file, open the GCP console and select IAM & Admin from the Navigation menu, accessed through the hamburger menu icon in the top left corner.

From the IAM & Admin menu, select the Service Accounts page and either create a new service account or click on the link of an existing one, found under the Email column of the service accounts table.

At the bottom of the Details page for the selected service account, click Add Key > Create New Key and select the JSON option.

This will download the JSON credentials file to your machine.

Anyone with access to this file will have the credentials necessary to make changes to your cloud resources according to the permissions of this service account. Store it in a secure place and do not check this file into source control. If you do, immediately delete the key from the same menu used to create it and remove the JSON file from source control.

To allow your client object to use these credentials and access GCP cloud storage, initializing the client will require a few extra steps. You will need to create a credentials object using the from_service_account_file method on the service_account.Credentials class of the google.oauth2 library. The only required argument for this method is the absolute or relative file path to your JSON credentials file.

This credentials object will be passed as a second argument to the storage.Client class constructor.

from google.cloud import storage
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_FILE)

client = storage.Client(
    project=GCP_PROJECT_ID,
    credentials=credentials
)

Uploading Local Files to Cloud Storage as a Zip File

Now that your client object has the required permissions to access cloud storage you can begin uploading local files as a zip file.

Assuming that the files you intend to upload are all in the same directory and are not already zipped, you will upload the files to GCP cloud storage as a zip file by creating a zip archive in memory and uploading it as bytes.

from google.cloud import storage
from zipfile import ZipFile, ZipInfo

def upload():
    source_dir = pathlib.Path(SOURCE_DIRECTORY)

    archive = io.BytesIO()
    with ZipFile(archive, 'w') as zip_archive:
        for file_path in source_dir.iterdir():
            with open(file_path, 'r') as file:
                zip_entry_name = file_path.name
                zip_file = ZipInfo(zip_entry_name)
                zip_archive.writestr(zip_file, file.read())

    archive.seek(0)

    object_name = 'super-important-data-v1'
    bucket = client.bucket(BUCKET_NAME)

    blob = storage.Blob(object_name, bucket)
    blob.upload_from_file(archive, content_type='application/zip')

io.BytesIO() creates an in memory binary stream used by the ZipFile object to store all the data from your local files as bytes.

The files in the source directory are iterated over and for each one a ZipInfo object is created and written to the ZipFile object along with the contents of the source file. The ZipInfo object corresponds to an individual file entry within a zip file and will be labeled with whatever file name and extension you use in the constructor to instantiate the ZipInfo object. Using zip_entry_name = file_path.name as in the example above will set the file name and extension in the zip file to match the name and extension of the local file.

The in memory binary stream (the archive variable) is what you will be uploading to GCP cloud storage, however a prerequisite for uploading an in memory stream is that the stream position be set to the start of the stream. Without moving the position of the stream back to zero with archive.seek(0) you will get an error from the Google API when you try to upload the data.

With the in memory binary stream ready to be delivered, the remaining lines of code create a new Bucket object for the specified bucket and a Blob object for the storage object. The zipped files are then uploaded to cloud storage and can later retrieved using the storage object name you used to create the Blob instance.

A bucket in cloud storage is a user defined partition for the logical separation of data and a blob (as the Python class is called) is another name for a storage object.

Downloading a Zip File Blob in Cloud Storage to a Local Directory

To download a zip file storage object and unzip it into a local directory, you will need to reverse the process by first creating a bucket object and a blob object in order to download the zip file as bytes.

def download():
    target_dir = pathlib.Path(TARGET_DIRECTORY)

    object_name = 'super-important-data-v1'
    bucket = client.bucket(BUCKET_NAME)

    blob = storage.Blob(object_name, bucket)
    object_bytes = blob.download_as_bytes()

    archive = io.BytesIO()
    archive.write(object_bytes)

    with ZipFile(archive, 'w') as zip_archive:
        zip_archive.extractall(target_dir)

Once downloaded, the bytes can be written to an in memory stream which will in turn be used to create a ZipFile object in order to extract the files to your target directory. io.BytesIO() is again used to create the in memory binary stream and the write method on the BytesIO object is used to write the downloaded bytes to the stream. The ZipFile object has a method for extracting all of its contents to a specified directory, making the final step a simple one.

With these two functions and the appropriate credentials you should have everything you need to start uploading and downloading your own zip files into cloud storage using Python.

And if you'd like to see all the Python code in one place, you can find it here as a Gist on my Github account.