Google Cloud Run Integrates NVIDIA L4 GPUs for Enhanced AI Inference Deployments

Google Cloud Run has announced the integration of NVIDIA L4 Tensor Core GPUs, NVIDIA NIM microservices, and capabilities for serverless AI inference deployments, according to the NVIDIA Technical Blog. This collaboration aims to address the challenges enterprises face when deploying AI-enabled applications, including performance optimization, scalability, and infrastructure complexity.

Enhancing AI Inference Deployments

Google Cloud’s fully managed serverless container runtime, Cloud Run, now supports NVIDIA L4 Tensor Core GPUs in preview. This allows enterprises to run real-time AI applications on demand without the hassle of managing infrastructure. The integration of NVIDIA NIM microservices further simplifies the optimization and deployment of AI models, maximizing application performance and reducing complexity.

Real-Time AI-Enabled Applications

Cloud Run abstracts infrastructure management by dynamically allocating resources based on incoming traffic, ensuring efficient scaling and resource utilization. The support for NVIDIA L4 GPUs represents a significant upgrade from previous CPU-only offerings, providing up to 120x higher AI video performance over CPU solutions and 2.7x more generative AI inference performance over the previous generation.

Notably, companies like Let’s Enhance, Wombo, Writer, Descript, and AppLovin are leveraging NVIDIA L4 GPUs to power their generative AI applications, delivering enhanced user experiences.

Performance-Optimized Serverless AI Inference

Optimizing AI model performance is crucial for resource efficiency and cost management. NVIDIA NIM offers a set of optimized cloud-native microservices that simplify and accelerate AI model deployment. These pre-optimized, containerized models integrate seamlessly into applications, reducing development time and maximizing resource efficiency.

NVIDIA NIM on Cloud Run allows for the deployment of high-performance AI applications using optimized inference engines that unlock the full potential of NVIDIA L4 GPUs, providing superior throughput and latency without requiring specialized expertise in inference performance optimization.

Deploying Llama3-8B-Instruct NIM Microservice

Deploying models like Llama3-8B-Instruct with Cloud Run on NVIDIA L4 GPUs is straightforward. Users need to install the Google Cloud SDK and follow a series of steps to clone the repository, set environment variables, edit the Dockerfile, build the container, and deploy it using provided scripts.

Getting Started

The integration of the NVIDIA AI platform, including NVIDIA NIM and NVIDIA L4 GPUs, with Google Cloud Run addresses key challenges in AI application deployment. This synergy accelerates deployment, boosts performance, and ensures operational efficiency and cost-effectiveness.

Developers can prototype with NVIDIA NIM microservices through the NVIDIA API catalog, then download NIM containers for further development on Google Cloud Run. For enterprise-grade security and support, a 90-day NVIDIA AI Enterprise license is available.

Currently, Cloud Run with NVIDIA L4 GPU support is in preview in the us-central1 Google Cloud region. More information and demos are available at the launch event livestream and sign-up page.