DEV Community

David Haley
David Haley

Posted on • Edited on

First impressions: GPU + GCP Batch

Weihao and I have been working on programmatic benchmarks for DeepCell on Google Batch.

We tried Vertex AI custom training jobs but ran into an issue with service accounts. It appears that the training job ran on the expected(?) service account, but in an unexpected project. We didn't track down how to give that project's user access to BigQuery. We also figured that we may want to run the container a little closer to the metal (not a VM though).

Enter Google Batch … I've used Batch-like products but never with a GPU. Initial work often looks like a lot of red failures 🥲

Screenshot of the Batch jobs list. Mostly failures, some successes.

First impressions:

1: BigQuery rate limit

I forgot BigQuery has a fairly low rate limit (5 ops per 10 seconds). So a batch of 10 finishing too close would overwhelm the table update. Quick fix with retry logic.

2: GPU scarcity

We've had bad luck getting GPUs. The zone reports exhausted resource pools on the regular:

Screenshot of an error message showing that the GCE resource pool is exhausted for the zone.

We ran into a surprising quota issue as well, running out of persistent disk SSDs – even though we weren't using any…

Screenshot of error message showing inadequate SSD quota: limit 500, usage 480, wanted 30.

The quota page showed the usage going up and down (again, we never observed any disks in the GCE console):

Screenshot of the quota visualization showing

You can (kinda) see it trying different availability zones within region us-central1 here:

Screenshot of the monitoring metric for allocated quota, showing several availability zones summing up to an overall regional usage.

We tried increasing the quota to 1TB (from 500 GB). No luck so far: no resources…!

The quota goes up in increments of 30GB, one per zone resource exhaustion error. I'm guessing it's a Batch implementation detail to spin up the disks in anticipation of having a VM ready.

Fortunately…! there is no billing charge for these disks. It's nice that it only bills when it actually runs, although it's odd to use up the quota.

I've heard several reports of using GPU on Batch, but it's clear that the incantations are arcane indeed. If you know how to reliably get GPUs or have worked through these errors– please let me know!

Top comments (0)