What is ComfyUI?
ComfyUI is a powerful and flexible user interface for Stable Diffusion, allowing users to create complex image generation workflows through a node-based system. While ComfyUI comes with a variety of built-in nodes, its true strength lies in its extensibility. Custom nodes enable users to add new functionality, integrate external services, and tailor it to their specific needs.
In this blog post, we will walk through the process of creating a custom node for image captioning using ComfyUI. This node will take an image as input and return a generated caption using an external API.
We will be using Google Gemini API for generating the caption of an image.
So here is the entire code which does the ImageCaptioning using Gemini API.
You can copy the following code into any file under the custom_nodes
folder in ComfyUI, I have named mine as gemini-caption.py
Complete code for Generating Image Captions
import numpy as np
from PIL import Image
import requests
import io
import base64
class ImageCaptioningNode:
@classmethod
def INPUT_TYPES(s):
return {
"required": {"image": ("IMAGE",), "api_key": ("STRING", {"default": ""})}
}
RETURN_TYPES = ("STRING",)
FUNCTION = "caption_image"
CATEGORY = "image"
OUTPUT_NODE = True
def caption_image(self, image, api_key):
# Convert the image tensor to a PIL Image
image = Image.fromarray(
np.clip(255.0 * image.cpu().numpy().squeeze(), 0, 255).astype(np.uint8)
)
# Convert the image to base64
buffered = io.BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()
api_url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key={api_key}"
payload = {
"contents": [
{
"parts": [
{
"text": "Generate a caption for this image in as detail as possible. Don't send anything else apart from the caption."
},
{"inline_data": {"mime_type": "image/png", "data": img_str}},
]
}
]
}
# Send the request to the Gemini API
try:
response = requests.post(api_url, json=payload)
response.raise_for_status()
caption = response.json()["candidates"][0]["content"]["parts"][0]["text"]
except requests.exceptions.RequestException as e:
caption = f"Error: Unable to generate caption. {str(e)}"
print(caption)
return (caption,)
NODE_CLASS_MAPPINGS = {"ImageCaptioningNode": ImageCaptioningNode}
Here is how the node looks on the UI:
Let's go over it line by line, to get an understanding how do we go about creating a similar node for your use case. First of all whatever node you want to create, make it as a function, so you can call it just in the same way in ComfyUI, as I did here for my caption_image
function.
Import the necessary libraries needed
import numpy as np
from PIL import Image
import requests
import io
import base64
These lines import the necessary libraries for my Image Captioning node:
-
numpy
for numerical operations -
PIL
(Python Imaging Library) for image processing -
requests
for making HTTP requests to Gemini API -
io
for handling byte streams -
base64
for encoding the image
Defining the ClassName for your ComfyUI node
class ImageCaptioningNode:
@classmethod
def INPUT_TYPES(s):
return {
"required": {"image": ("IMAGE",), "api_key": ("STRING", {"default": ""})}
}
In my case, I have named it as ImageCaptioningNode as it does what is says.
The class method defines the input types for our node:
- An "image" input of type "IMAGE"
- An "api_key" input of type "STRING" with a default empty value, needed for sending API requests to Gemini API.
RETURN_TYPES = ("STRING",)
FUNCTION = "caption_image"
CATEGORY = "image"
OUTPUT_NODE = True
These class variables define:
- The return type (a string)
- The main function to be called ("caption_image")
- The category in which the node will appear in ComfyUI
- That this node can be an output node
def caption_image(self, image, api_key):
# Convert the image tensor to a PIL Image
image = Image.fromarray(
np.clip(255.0 * image.cpu().numpy().squeeze(), 0, 255).astype(np.uint8)
)
# Convert the image to base64
buffered = io.BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()
api_url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key={api_key}"
# Prepare the request payload
payload = {
"contents": [
{
"parts": [
{
"text": "Generate a caption for this image in as detail as possible. Don't send anything else apart from the caption."
},
{"inline_data": {"mime_type": "image/png", "data": img_str}},
]
}
]
}
try:
response = requests.post(api_url, json=payload)
response.raise_for_status()
caption = response.json()["candidates"][0]["content"]["parts"][0]["text"]
except requests.exceptions.RequestException as e:
caption = f"Error: Unable to generate caption. {str(e)}"
print(caption)
return (caption,)
This is a standalone function which I have written that takes an Image as input, and sends it to Gemini API using the API key. The code is straightforward, we are just doing base64 encoding so image gets sent via API. We instruct Gemini to caption the image in detail using the prompt. The response from API is parsed, and printed in the console and returned as a tuple (required by ComfyUI).
NODE_CLASS_MAPPINGS = {"ImageCaptioningNode": ImageCaptioningNode}
This dictionary maps the class name to the class itself, which is used by ComfyUI to register the custom node.
To conclude your article on creating a custom ComfyUI node, you can summarize the key points and provide some final thoughts. Here's a suggested conclusion:
Conclusion:
Creating custom nodes for ComfyUI opens up a world of possibilities for extending and enhancing your image generation workflows. In this article, we've walked through the process of building a custom image captioning node, demonstrating how to:
- Define input and output types
- Integrate with external APIs (in this case, the Gemini API for image captioning)
By following these steps, you can create your own custom nodes to add virtually any functionality you need to ComfyUI. Whether you're integrating new LLM models, adding specialized image processing techniques, or creating shortcuts for common tasks, custom nodes allow you to tailor ComfyUI to your specific requirements.
Remember that while we've focused on image captioning in this example, the same principles can be applied to create nodes for a wide variety of tasks. The key is to understand the structure of a ComfyUI node and how to interface with the expected inputs and outputs.
In case if you still have any questions regarding this post or want to discuss something with me feel free to connect on LinkedIn or Twitter.
If you run an organization and want me to write for you, please connect with me on my Socials 🙃
Top comments (5)
Congratulations on the post, you were very clear in the explanation. I have a question, what could ComfyUI add to create a workflow structure compared to Airflow?
One observation, if you accept a tip. Put the link to the tool for the reader. Help to find reference.
comfyanonymous / ComfyUI
The most powerful and modular stable diffusion GUI, api and backend with a graph/nodes interface.
ComfyUI
The most powerful and modular stable diffusion GUI and backend.
This ui will let you design and execute advanced stable diffusion pipelines using a graph/nodes/flowchart based interface. For some workflow examples and see what ComfyUI can do you can check out:
ComfyUI Examples
Installing ComfyUI
Features
--cpu
(slow)Thanks @sc0v0ne, I have added repo link now.
I am not really sure about Apache Airflow, as I havent worked on it in the past
great article!
Thanks @axorax
no problem!
Some comments may only be visible to logged-in visitors. Sign in to view all comments.