Serverless GPU Computing: A Technical Deep Dive into CloudRun

#devfest #ai #llm #googlecloud

At DevFest Montreal 2024, I presented a talk on scaling GPU workloads using Google Kubernetes Engine (GKE), focusing on the complexities of load-based scaling. While GKE provided robust solutions for managing GPU workloads, we still faced the challenge of ongoing infrastructure costs, especially during periods of low utilization. Google's recent launch of GPU support in Cloud Run marks an exciting development in serverless computing, potentially addressing these scaling and cost challenges by offering GPU capabilities in a true serverless environment.

Cloud Run GPU: The Offering

Cloud Run is Google Cloud's serverless compute platform that allows developers to run containerized applications without managing the underlying infrastructure. The serverless model offers significant advantages:

Automatic scaling (including scaling to zero when there's no traffic)
Pay-per-use billing
Zero infrastructure management

However, it also comes with trade-offs, such as cold starts when scaling up from zero and maximum execution time limits.

The recent addition of GPU support to Cloud Run opens new possibilities for compute-intensive workloads in a serverless environment. This feature provides access to NVIDIA L4 GPUs, which are particularly well-suited for:

AI inference workloads
Video processing
3D rendering

The L4 GPU, built on NVIDIA's Ada Lovelace architecture, offers 24GB of GPU memory (VRAM) and supports key AI frameworks and CUDA applications. These GPUs provide a sweet spot between performance and cost, especially for inference workloads and graphics processing.

Understanding Cold Starts and Test Results

Having worked with serverless infrastructure for nearly a decade, I've encountered numerous challenges with cold starts across different platforms. With Cloud Run's new GPU feature, I was particularly interested in understanding the cold start behavior and its implications for real-world applications.

To investigate this, I designed an experiment to measure response times under different idle periods. The experiment consisted of running burst tests of 5 consecutive API calls to a GPU-enabled Cloud Run service at different intervals (5, 10, and 20 minutes). Each test was repeated multiple times to ensure consistency. The service performed a standardized 3D rendering workload, making it an ideal candidate for GPU acceleration.

Our testing revealed three distinct patterns:

Full Cold Start (~105-120 seconds): When no instances have been active for 10+ minutes
Warm Start (~6-7 seconds): When instances restart within 5 minutes of the last request
Hot Start (~1.5 seconds): Subsequent requests while an instance is active

Here's a summary of our findings:

Interval	First Request (ms)	Subsequent Requests (ms)	Instance State
5 minutes	6,800-7,000	1,400-1,800	Warm Start
10 minutes	105,000-107,000 (Cold)	1,400-1,700	Full Cold Start
10 minutes	6,800-7,200 (Warm)	1,400-1,700	Warm Start
20 minutes	105,000-120,000	1,400-1,800	Full Cold Start

Cloud Run's GPU support introduces an exciting option for organizations looking to optimize their GPU workloads without maintaining constant infrastructure. Our testing revealed interesting behavior at the 10-minute interval mark, where the instance sometimes remained warm (~7 seconds startup) and sometimes required a full cold start (~105-107 seconds). This variability suggests that Cloud Run's instance retention behavior isn't strictly time-based and might depend on other factors such as system load and resource availability.

While these cold start characteristics make it unsuitable for real-time applications requiring consistent sub-second response times, Cloud Run GPU excels in several scenarios:

Best suited for:

Batch processing workloads
Development and testing environments
Asynchronous processing systems
Scheduled jobs where startup time isn't critical

Not recommended for:

Real-time user-facing applications
Applications requiring consistent sub-second response times
Continuous high-throughput workloads

For teams working with periodic GPU workloads - whether it's scheduled rendering jobs, ML model inference, or development testing - Cloud Run GPU offers a compelling balance of performance and cost-effectiveness, especially when compared to maintaining always-on GPU infrastructure. Understanding these warm/cold start patterns is crucial for architecting solutions that can effectively leverage this serverless GPU capability.

The key to success with Cloud Run GPU is matching your workload patterns to the platform's characteristics. For workloads that can tolerate occasional cold starts, the cost savings and zero-maintenance benefits make it an attractive option in the GPU computing landscape.

DEV Community

Serverless GPU Computing: A Technical Deep Dive into CloudRun

Cloud Run GPU: The Offering

Understanding Cold Starts and Test Results

Here's a summary of our findings:

Top comments (0)

Read next

Tools I use in software engineering

AI That Masters Real Estate – How Realiste Works?

Google Axion: A New Leader in ARM Server Performance

Building a Discord Bot with OpenAI GPT