What is ZeroGPU
- Hugging Face Spaces now offers a new hardware option called ZeroGPU.
- ZeroGPU uses Nvidia A100 GPU devices under the hood, and 40GB of vRAM are available for each workload.
- This is achieved by making Spaces efficiently hold and release GPUs as needed, as opposed to a classical GPU Space that holds exactly one GPU at any given time.
- You can explore and use existing public ZeroGPU Spaces for free. The list of public zero GPU spaces can be found at (https://huggingface.co/spaces/enzostvs/zero-gpu-spaces)
Hosting models on ZeroGPU Spaces has the following restrictions.
- ZeroGPU is currently in beta. It only works with the Gradio SDK. It enables us to deploy Gradio applications run on ZeroGPU for free
- It is only available for Personal accounts subscribed to HuggingFace Pro. It will appear in the hardware list when you select the Gradio SDK.
- There is a limit of 10 ZeroGPU Spaces that Personal accounts with PRO subscriptions can host
- Though the documentation mentions that ZeroGPU uses an A100,You may observe a significantly lower performance than a standard A100 as GPU allocation may be time-sliced
Spaces and zero GPU
To make your Space work with ZeroGPU, you need to decorate the Python functions that require a GPU with @spaces.GPU
import spaces
@spaces.GPU
def my_inference_function(input_data, output_data,mode, max_length, max_new_tokens, model_size):
When a decorated function is invoked, the Space will be attributed a GPU, and it will release it upon completion of the function. You can find complete instructions to make your code compatible with zero GPU spaces at the link
ZeroGPU Explorers on Hugging Face
If your space is running on Zero GPU, you can see the status on the space's project page, along with the CPU and RAM consumption.
The supported software versions that are compatible with gradio SDK Zero GPU Spaces are
- Gradio: 4+
- PyTorch: All versions from
2.0.0
to2.2.0
- Python:
3.10.13
Duration
A duration
param in the decorator spaces.GPU
allows us to specific GPT time.
- The default is 60 seconds.
- If you expect your GPU function to take more than the 60s then you need to specify the same.
- If you know that your function will take far less than the 60s default, specifying that it will provide higher priority for visitors to your Space in the queue
@spaces.GPU(duration=20)
defgenerate(prompt):
return pipe(prompt).images
It will set the maximum duration of your function call to 120s.
Hosting Private ZeroGPU Spaces
- We will create a space
pi19404/ai-worker
, which is zeroGPU space. The visibility of space is private. - The gradio server hosted on space A provides shieldGemma2 model inference endpoint.
For more details of shieldGemma2, refer the article
LLM Content Safety Evaluation using ShieldGemma
- We can create a space
pi19404/shieldgemma-demo
that programmatically call an application hosted on spacepi19404/ai-worker
. The visibility of space B is public - We configure the hugging face token as a secret with name
API_TOKEN
project settings of spacepi19404/shieldgemma-demo
- We can call the gradio server API using gradio client as described below
from gradio_client import Client
API_TOKEN=os.getenv("API_TOKEN")
# Initialize the Gradio Client
# This connects to the private zeroGPU Hugging Face space "pi19404/ai-worker"
client = Client("pi19404/ai-worker",hf_token=API_TOKEN)
# Make a prediction using the client
# The predict method calls the specified API endpoint with the given parameters
result = client.predict(
# Input parameters for the my_inference_function API
input_data="Hello!!", # The input text to be evaluated
output_data="Hello!!", # The output text to be evaluated (if applicable)
mode="scoring", # The mode of operation: "scoring" or "generative"
max_length=150, # Maximum length of the input prompt
max_new_tokens=1024, # Maximum number of new tokens to generate
model_size="2B", # Size of the model to use: "2B", "9B", or "27B"
api_name="/my_inference_function" # The specific API endpoint to call
)
# Print the result of the prediction
print(result)
Explaining Rate Limits for ZeroGPU
The huggingface platform rate limits ZeroGPU spaces to ensure that a single user does not hog all available GPUs. The limit is controlled by a special token that the Hugging Face Hub infrastructure adds to all incoming requests to Spaces. This token is a request header called X-IP-Token
and its value changes depending on the user who requests the ZeroGPU space.
With the Python client, you will quickly exhaust your rate limit, as all the requests to the ZeroGPU space will have the same token. So, to avoid this, we need to extract the user's token using Space pi19404/shieldgemma-demo
before we call Space pi19404/ai-worker
programmatically.
When a new user visits the page
- We use the
load
event to extract the user’sx-ip-token
header when the user visits the page.
with gr.Blocks() as demo:
"""
Main Gradio interface setup.This block sets up the Gradio interface, including:
- A State component to store the client for the session.
- A JSON component to display request headers for debugging.
- Other UI components (not shown in this snippet).
- A load event that calls set_client_for_session when the interface is loaded.
"""
gr.Markdown("## LLM Safety Evaluation")
client = gr.State()
with gr.Tab("ShieldGemma2"):
input_text = gr.Textbox(label="Input Text")
output_text = gr.Textbox(
label="Response Text",
lines=5,
max_lines=10,
show_copy_button=True,
elem_classes=["wrap-text"]
)
mode_input = gr.Dropdown(choices=["scoring", "generative"], label="Prediction Mode")
max_length_input = gr.Number(label="Max Length", value=150)
max_new_tokens_input = gr.Number(label="Max New Tokens", value=1024)
model_size_input = gr.Dropdown(choices=["2B", "9B", "27B"], label="Model Size")
response_text = gr.Textbox(
label="Output Text",
lines=10,
max_lines=20,
show_copy_button=True,
elem_classes=["wrap-text"]
)
text_button = gr.Button("Submit")
text_button.click(fn=my_inference_function, inputs=[client,input_text, output_text, mode_input, max_length_input, max_new_tokens_input, model_size_input], outputs=response_text)
demo.load(set_client_for_session,None,client)
demo.launch(share=True)
- We create a new Gradio client with this header passed to the
headers
parameter.
# Create an OrderedDict to store clients, limited to 15 entries
client_cache = OrderedDict()
MAX_CACHE_SIZE = 15
default_client=Client("pi19404/ai-worker", hf_token=API_TOKEN)
def get_client_for_ip(ip_address,x_ip_token):
"""
Retrieve or create a client for the given IP address.This function implements a caching mechanism to store up to MAX_CACHE_SIZE clients.
If a client for the given IP exists in the cache, it's returned and moved to the end
of the cache (marking it as most recently used). If not, a new client is created,
added to the cache, and the least recently used client is removed if the cache is full.
Args:
ip_address (str): The IP address of the client.
x_ip_token (str): The X-IP-Token header value for the client.
Returns:
Client: A Gradio client instance for the given IP address.
"""
if x_ip_token is None:
x_ip_token=ip_address
#print("ipaddress is ",x_ip_token)
if x_ip_token is None:
new_client=default_client
else:
if x_ip_token in client_cache:
# Move the accessed item to the end (most recently used)
client_cache.move_to_end(x_ip_token)
return client_cache[x_ip_token]
# Create a new client
new_client = Client("pi19404/ai-worker", hf_token=API_TOKEN, \\
headers={"X-IP-Token": x_ip_token})
# Add to cache, removing oldest if necessary
if len(client_cache) >= MAX_CACHE_SIZE:
client_cache.popitem(last=False)
client_cache[x_ip_token] = new_client
return new_client
- This ensures all subsequent predictions pass this header to the ZeroGPU space.
- The client is saved in a State variable so that it is independent from other users. It is deleted automatically when the user exits the page.
- We will also save the Gradio client in an in-memory cache so that we do not need to create a client again if the user loads a page with the same IP address.
# Create an OrderedDict to store clients, limited to 15 entries
client_cache = OrderedDict()
MAX_CACHE_SIZE = 15
default_client=Client("pi19404/ai-worker", hf_token=API_TOKEN)
def get_client_for_ip(ip_address,x_ip_token):
"""
Retrieve or create a client for the given IP address.This function implements a caching mechanism to store up to MAX_CACHE_SIZE clients.
If a client for the given IP exists in the cache, it's returned and moved to the end
of the cache (marking it as most recently used). If not, a new client is created,
added to the cache, and the least recently used client is removed if the cache is full.
Args:
ip_address (str): The IP address of the client.
x_ip_token (str): The X-IP-Token header value for the client.
Returns:
Client: A Gradio client instance for the given IP address.
"""
if x_ip_token is None:
x_ip_token=ip_address
#print("ipaddress is ",x_ip_token)
if x_ip_token is None:
new_client=default_client
else:
if x_ip_token in client_cache:
# Move the accessed item to the end (most recently used)
client_cache.move_to_end(x_ip_token)
return client_cache[x_ip_token]
# Create a new client
new_client = Client("pi19404/ai-worker", hf_token=API_TOKEN, headers={"X-IP-Token": x_ip_token})
# Add to cache, removing oldest if necessary
if len(client_cache) >= MAX_CACHE_SIZE:
client_cache.popitem(last=False)
client_cache[x_ip_token] = new_client
return new_client
You can find the full gradio client code at
ShieldGemma Demo - a Hugging Face Space by pi19404
Public Gradio Interface and Code
you can find the link to gradio interface at
Shieldgemma Demo - a Hugging Face Space by pi19404
Top comments (0)