DEV Community

Cover image for Llama 3.2 Vision(11B vision-instruct model) in Kaggle: A Step-by-Step Guide
Mr.Shah
Mr.Shah

Posted on

Llama 3.2 Vision(11B vision-instruct model) in Kaggle: A Step-by-Step Guide

In this guide, I'll walk you through how to use the Llama 3.2 11B vision model on Kaggle, a popular platform for data science and machine learning projects.

Step 1: Getting the Green Light

Before we dive into the code, there's a bit of paperwork to handle. Meta (you know, the folks behind Facebook) created Llama, and they want to make sure it's used responsibly. So, your first task is to get their approval:

  • Head over to the official Meta website.
  • Look for the Llama model license application.
  • Fill out the form and explain why you want to use Llama.
  • Cross your fingers and wait for approval!

LLama

Step 2: Using Llama 3.2 11B Vision-Instruct Model in Kaggle

  1. Approval from Meta for the Llama 3.2 11B vision-instruct model
  2. Kaggle account (use the same email as in Meta approval)
  3. Create a new notebook and enable GPU acceleration (if available)

Step 3: Installation

!pip install transformers==4.45.1

Step 4: Import Necessary Modules

from transformers import AutoProcessor, AutoModelForCausalLM
import torch
from PIL import Image
import requests
Enter fullscreen mode Exit fullscreen mode

Step 5: Load the Model

model_id = "/kaggle/input/llama-3.2-vision/transformers/11b-vision-instruct/1"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
Enter fullscreen mode Exit fullscreen mode

Step 6: Prepare an Image

url = "https://example.com/your-image.jpg"
image = Image.open(requests.get(url, stream=True).raw)
Enter fullscreen mode Exit fullscreen mode

Step 7: Create Input Messages

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image in detail."}
    ]}
]
Enter fullscreen mode Exit fullscreen mode

Step 8: Process Input and Generate Output

input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
Enter fullscreen mode Exit fullscreen mode

Step 9: Display Results

generated_text = processor.decode(output[0], skip_special_tokens=True)
print(generated_text)
Enter fullscreen mode Exit fullscreen mode

Advanced Version - Fine-tuning Your Input and Output

Let's look at a chunk of code that might seem a bit intimidating at first, but I promise it's not as scary as it looks:

# Process the input
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)

# Calculate the number of tokens in the input
input_token_count = inputs["input_ids"].shape[-1]

# Calculate the maximum number of new tokens
max_new_tokens = 200 * 3  # Using the upper limit and 3 tokens per word to ensure full coverage

# Generate the output
output = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.7)

# Decode and print the generated text
generated_text = processor.decode(output[0][input_token_count:], skip_special_tokens=True)
print(generated_text)
Enter fullscreen mode Exit fullscreen mode

Try the above code after step-7 and compare the results which one is better.

What's Going On Here?

Let's break this down in simple terms:

  • We're importing some tools to help us work with Llama and handle images.
  • We tell the computer which version of Llama we want to use.
  • We grab an image from the internet for Llama to look at.
  • We ask Llama a question about the image.
  • We prepare the question and image in a way Llama can understand.
  • We let Llama think about it and come up with an answer.
  • Finally, we translate Llama's answer into human-readable text and print it out.

Why This Matters

By tweaking these settings, you can control how Llama responds to your prompts:

  • Want a more creative, out-of-the-box description? Try increasing the temperature.
  • Need a more focused, detailed analysis? Lower the temperature and increase the max_new_tokens.
  • Working with a complex image? You might want to increase max_new_tokens to give Llama more room to describe everything it sees.

And there you have it! You've just taught a computer to see and describe an image. Llama will look at the picture and tell you what it sees, just like a person would.

Remember, Llama is pretty smart, but it's not perfect. Sometimes it might see things that aren't there, or miss things that are. That's why it's important to use AI responsibly and always double-check its work.

Happy coding, and may your AI adventures be filled with exciting discoveries!

Bye

Top comments (0)