Get started with multimodal conversational models using the open-source LLaVA-1.5 model, Hugging Face Transformers and Runhouse.
The full Python code for this tutorial, including standing up the necessary infrastructure, is publicly available in this Github repo for you to try for yourself.
For the first version of this tutorial check out this post.
Introduction
Multimodal conversational models represent a leap forward from text-only AI by harnessing the strengths of Large Language Models (LLMs) and Reinforcement Learning from Human Feedback (RLHF) to address challenges that require a combination of language and additional modalities (e.g. image and text). Visual capabilities in GPT-4V(ision) have been a recent highlight, leading to the creation of a sophisticated model proficient in both language and image comprehension in the same context. Though GPT-4V is undeniably advanced, the proprietary nature of such closed-source models often restricts the scope for research and innovation. The AI in radiology industry, for example, often requires FDA approval for new technologies. Relying on a model that produces varying results from month to month presents obvious challenges with reproducibility during audits.
Fortunately, the landscape is evolving with the introduction of open-source alternatives, democratizing access to vision-language models. Deploying these models is not trivial, especially on self-hosted hardware.
Runhouse is an open-source platform that makes it easy to deploy and run your machine learning application on any cloud or self-hosted infrastructure. In this tutorial, we will guide you step-by-step on how to create your own vision chat assistant that leverages the innovative LLaVA-1.5 (Large Language and Vision Assistant) multimodal model, as described in the Visual Instruction Tuning paper. After a brief overview of the LLaVA-1.5 model, we'll delve into the implementation code to construct a vision chat assistant, utilizing resources from the official code repository. Runhouse will allow us to stand up the necessary infrastructure and deploy the visual chatbot application in just 4 lines of Python code (!!)
What is LLaVA-1.5?
The LLaVA model was introduced in the paper Visual Instruction Tuning, and then further improved in Improved Baselines with Visual Instruction Tuning (also referred to as LLaVA-1.5).
The core idea behind it is to extract visual embeddings from an image and treat them in the same way as embeddings coming from textual language tokens by feeding them to a Large Language Model (Vicuna). To choose the “right” embeddings, the model uses a pre-trained CLIP visual encoder to extract the visual embeddings and then projects them into the word embedding space of the language model. The projection operation is accomplished using a vision-language connector, which was originally chosen to be a simple linear layer in the first paper, and later replaced with a more expressive Multilayer Perceptron (MLP) in Improved Baselines with Visual Instruction. The architecture of the model is depicted below.
One of the advantages of the method is that by using a pre-trained vision encoder and a pre-trained language model, only the lightweight vision-language connector must be learned from scratch.
One of resulting impressive model versions, LLaVA-1.5 13b, achieved SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizing all public data, completed training in ~1 day on a single 8-A100 node, and surpassed other methods that use billion-scale data (source).
Building upon the fact that the LLaVA model is so lightweight to train and fine-tune, novel domain specific-agents can be created in no time. One such example is Microsoft Large Language and Vision Assistant for bioMedicine LLaVA-Med. More on it in a subsequent post.
LLaVA-1.5 Visual Chatbot Implementation
The full Python code is available in this Github repo so that you can try it yourself.
Note: this code builds upon the amazing Hugging Face Transformers 4.36.0 release adding support for Mixtral, LLava, BakLLava and more.
Creating a multimodal chatbot (e.g. ask follow up textual questions about the provided image) using the code provided in the official repository is relatively straightforward. The repository provides standardized chat templates that can be used to parse the inputs in the right format. Following the right format used in training and fine-tuning is crucial for the quality of the answers generated by the model. The exact template depends on the specific variant of the language model used. The template for LLaVA-1.5 with a pre-trained Vicuna language model looks like this:
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions.
USER: <im_start><image><im_end> User's prompt
ASSISTANT: Assistant answer
USER: Another prompt
The first few lines are the general system prompt used by the model. The special tokens <im_start>
, <image>
, and <im_end>
are used to indicate where embeddings representing the image will be placed. The chatbot can be defined in just one simple Python class. Here it is:
# imports
class LlavaModel(rh.Module):
def __init__(self, model_id="llava-hf/llava-1.5-7b-hf", **model_kwargs):
super().__init__()
self.model_id, self.model_kwargs = model_id, model_kwargs
self.model = None
def load_model(self):
self.model = pipeline("image-to-text",
model=self.model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
model_kwargs=self.model_kwargs)
def predict(self, img_path, prompt, **inf_kwargs):
if not self.model:
self.load_model()
with torch.no_grad():
image = Image.open(requests.get(img_path, stream=True).raw)
return self.model(image, prompt=prompt, generate_kwargs=inf_kwargs)[0]["generated_text"]
if __name__ == "__main__":
gpu = rh.cluster(name="rh-a10x", instance_type="A10G:1")
remote_llava_model = LlavaModel(load_in_4bit=True).get_or_to(system=gpu,
env=rh.env(["transformers==4.36.0"],
working_dir="local:./"),
name="llava-model")
ans = remote_llava_model.predict(img_path="https://upcdn.io/kW15bGw/raw/uploads/2023/09/22/file-387X.png",
prompt="USER: <image>\nHow would I make this dish? Step by step please."
"\nASSISTANT:",
max_new_tokens=200)
print(ans)
Let’s walk through the methods of the class defined above.
load_model: loads the language model using the Hugging Face
pipeline()
abstraction method with the specified parameters for quantization (torch.bfloat16
). The method is part of the Hugging Face Transformers framework. Quantization to 16, 8 or 4-bit allows for reduced GPU memory requirements. In our case, the quantization we are using fits into one NVIDIA A10G GPU.predict: takes an image (
img_path
) and a text prompt (prompt
) and returns the textual response of the loaded Llava modelmain: Runhouse-enabled setup handling all aspects of setting up the remote compute cluster (using AWS EC2 with a single A10G)
Now let’s take a look at the Runhouse-specific code in detail.
import runhouse as rh
A prerequisite to the awesomeness in this article.
class LlavaModel(rh.Module):
The LlavaModel
class inherits from the Runhouse Module class. Modules represent classes that can be sent to and used on remote clusters and environments. They support remote execution of methods.
gpu = rh.cluster(name="rh-a10x", instance_type="A10G:1")
This command defines a new on-demand cluster named rh-a10x utilizing 1 NVIDIA A10G GPU. A prerequisite to this command is setting up at least one cloud provider using the following documentation. For the purpose of this tutorial, we’ve used AWS.
remote_llava_model = LlavaModel(load_in_4bit=True).get_or_to(system=gpu,
env=rh.env(["transformers==4.36.0"],working_dir="local:./"),
name="llava-model")
This command quantizes the model to fit onto the available memory, defines the transformers
environment and names the model llava-model
.
The get_or_to
Runhouse method is an alternative to the simpler to
function that allows us to deploy a LlavaModel instance to the GPU cluster defined above. It provides a way to reuse an existing instance if one is found with the specified name, saving costs for the team.
Now that we’ve created our multimodal chatbot and deployed it on an on-demand cluster, we’ll walk through running the inference and what an example output might look like!
Running Inference on our Visual Chatbot
Now that our model is deployed to an on-demand AWS cluster, we’re able to run the conversational UX using the predict()
method. We’ll be asking questions about a matcha hot dog image, to test the model’s ability to understand an unnatural image.
ans = remote_llava_model.predict(img_path="https://upcdn.io/kW15bGw/raw/uploads/2023/09/22/file-387X.png",
prompt="USER: <image>\nHow would I make this dish? Step by step please."
"\nASSISTANT:",
max_new_tokens=200)
print(ans)
The img_path
corresponds to a publicly available link to the hot dog image and the prompt
is an initial question to ask our model.
One could implement a
continue_chat
method to demonstrate that this is an actual chat interface by asking follow up questions about the image. Please see the previous version of the post on how to do so.
After running this the first time to set up the infrastructure, the output should look like this. (Notice how logs and stdout are streamed back to you as if the application was running locally. Thank you Runhouse!)
python -m llava_chat.llava_chat_transformers
INFO | 2023-11-28 18:09:42.394662 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-11-28 18:09:42.550811 | Authentication (publickey) successful!
INFO | 2023-11-28 18:09:42.551436 | Connecting to server via SSH, port forwarding via port 32300.
INFO | 2023-11-28 18:09:45.611387 | Checking server rh-a10x
INFO | 2023-11-28 18:09:45.652054 | Server rh-a10x is up.
INFO | 2023-11-28 18:09:45.652610 | Getting llava-model
INFO | 2023-11-28 18:09:46.346477 | Time to get llava-model: 0.69 seconds
INFO | 2023-11-28 18:09:46.347149 | Calling llava-model.predict
base_env servlet: Calling method predict on module llava-model
INFO | 2023-11-28 18:09:57.475645 | Time to call llava-model.predict: 11.13 seconds
... Answer ...
To make this dish, which appears to be a hot dog covered in green sauce and possibly topped with seaweed, follow these steps:
1. Prepare the hot dog: Grill or boil the hot dog until it is cooked through and heated to your desired temperature.
2. Prepare the green sauce: Combine ingredients like mayonnaise, mustard, ketchup, and green food coloring to create a green sauce. You can also add other ingredients like chopped onions, relish, or pickles to enhance the flavor.
3. Prepare the seaweed: Wash and chop the seaweed into small pieces. You can use a food processor or a knife to achieve the desired size.
4. Assemble the hot dog: Place the cooked hot dog in a bun and spread the green sauce evenly over the top. Add the chopped seaweed pieces on top of the sauce, ensuring they are evenly distributed.
5. Serve: Serve the hot dog with a side of your choice, such as chips or a salad, and enjoy your unique and delicious creation.
As we know by now, a matcha hot dog is a dish best served cold.
If you want to try it for yourself, this tutorial is hosted in this public Github repo.
Conclusion
Visual chat models are a major step forward from text-only AI that introduce vision capabilities to conversations. For certain applications, self-hosting is crucial for auditability, reproducibility, and controlling the accuracy and performance of the application. This can be a hard requirement for certain medical and financial use cases. In addition, deploying with Runhouse can help reduce training and inference costs by automatically selecting cloud providers based on price and availability.
In a subsequent post we’ll explore use cases leveraging LLaVA-Med and other potential medical field (and in particular, radiology AI) machine learning applications.
Top comments (0)