Aryan Kargwal

Posted on Oct 24

Multi-Modality and Image Gen in a 1.3B Model!🔮

#streamlit #transformers #computervision #tutorial

Today, we’re diving into something exciting: Janus 1.3B, one of the tiniest yet competent truly multimodal LLMs. What sets Janus apart is that, despite its smaller size, it delivers powerful results in natural language processing and image generation. This is a perfect example of where AI is heading—smaller models yet versatile and multimodal.

Janus 1.3B

So, what exactly is Janus 1.3B? At its core, Janus is a vision-language model (VLM) designed to handle textual and visual data. With just 1.3 billion parameters, Janus is significantly smaller than some of the other LLMs and multimodal models we’ve discussed on the channel. But don’t let its size fool you; it can perform both text and image generation, making it a powerful tool despite its relatively compact size.

Unlike most models, which specialise in one area or need large architectures to function effectively in multiple domains, Janus achieves this multimodal functionality at a much smaller scale. This is a massive step in making AI more efficient, accessible, and, most importantly, scalable.

How Does Janus Work?

Let’s start with its architecture. Janus processes text understanding, multimodal understanding, and visual generation through independent encoding methods that eventually feed into a unified autoregressive transformer. This design allows it to handle different types of input—text, images, or a combination of both—in a highly efficient manner.

Here’s the breakdown of how it all works:

Text Understanding: Janus employs a built-in tokenizer from its underlying LLM. This tokenizer converts text into discrete IDs (tokens), which are transformed into feature representations. The LLM processes these features in the same way as any other text-based model.
Multimodal Understanding: Janus integrates SigLIP, a powerful vision encoder that extracts high-dimensional semantic features from images for image processing. These features are flattened from a 2D grid into a 1D sequence and passed through an understanding adaptor. This adaptor maps the image features into the input space of the LLM, ensuring that both image and text data are represented in a way that the model can understand together.
Image Generation: Janus utilizes a Vector Quantization (VQ) tokenizer to generate images. This tokenizer converts images into a sequence of discrete IDs. These ID sequences are flattened and passed through a generation adaptor, which maps them into the LLM’s input space. This allows Janus to generate image content from a text description. A specialized image prediction head is trained for this task, while Janus relies on the LLM’s existing text prediction head for text-based tasks.

Once the inputs, whether text, image, or both, are converted into feature sequences, Janus concatenates them into a unified multimodal feature sequence. This sequence is then fed into the LLM for processing, making it capable of generating text and images based on the input it receives.

Janus Multi-Modal Performance

Now, let’s talk performance. Despite its relatively small size of 1.3 billion parameters, Janus is competitive across several multimodal tasks. It excels in Visual Question Answering (VQA) benchmarks, COCO Captioning, and Image-Text Retrieval.

Janus is designed to handle real-world multimodal applications where parameter efficiency is critical. While larger models might outperform Janus on tasks that require deep reasoning over complex text or high-resolution images, Janus hits a sweet spot by balancing efficiency and performance for general-purpose multimodal applications.

How to Use Janus for Multi-Modal Integration

Now, let us see how to use the model for multimodal inferences. Below is an example of how to set up the generate_answer function, which takes an image and a question as inputs.

def generate_answer(image_path, question):
    # Load the VL-GPT model, tokenizer, and visual language chat processor
    model = load_vl_gpt_model()
    tokenizer = load_tokenizer()
    vl_chat_processor = load_vl_chat_processor()

    # Define conversation structure
    conversation = f"{question} [image: {image_path}]"

    # Prepare image for processing
    image = preprocess_image(image_path)

    # Prepare inputs for the model
    inputs = vl_chat_processor.process(image, conversation)

    # Generate input embeddings
    input_embeddings = model.get_embeddings(inputs)

    # Generate answer using the VL-GPT model
    answer = model.generate(input_embeddings)

    return decode_answer(answer)

In this code, we load the necessary components, prepare the image and question for processing, and generate a response that combines visual context with the posed question.

Janus Image Generation

Finally, let’s examine Janus’ image generation capabilities. While it’s not as large as dedicated models like DALL-E 2 or Stable Diffusion, Janus still creates high-quality images from textual inputs in an incredibly compact form.

As mentioned, Janus uses the VQ tokenizer to convert images into discrete tokens. These tokens are then processed using a latent diffusion model, generating the image in stages and refining it over time to match the text input. The result? Images that are highly coherent and contextually accurate, especially when dealing with more straightforward or abstract prompts.

How to Use Janus for Image Generation

The process starts with tokenizing the prompt using the vl_chat_processor. This converts the text into numerical representations that the model can understand.

def generate_image(prompt):
    # Tokenize the prompt
    tokenized_prompt = vl_chat_processor.tokenize(prompt)

    # Create initial embeddings from tokens
    initial_embeddings = model.create_embeddings(tokenized_prompt)

    # Generate image tokens iteratively
    image_tokens = []
    for _ in range(num_tokens):
        token = model.generate_next_token(initial_embeddings)
        image_tokens.append(token)
        initial_embeddings = model.update_embeddings(initial_embeddings, token)

    # Decode tokens into an image
    image = decode_image(image_tokens)

    # Save image to disk
    save_image(image, "output_image.jpg")

This code illustrates generating an image based on a text prompt using Janus. It showcases the iterative process of generating image tokens while ensuring relevance to the original prompt.

Conclusion

So there you have it—Janus 1.3B, a small but compelling multimodal model that punches well above its weight. Its ability to handle text understanding, multimodal reasoning, and image generation in such a compact framework is a testament to the efficiency of its design.

For those interested in multimodal AI that can be deployed in real-world applications without massive computational power, Janus is a model you should watch.

DEV Community

Multi-Modality and Image Gen in a 1.3B Model!🔮

Janus 1.3B

How Does Janus Work?

Janus Multi-Modal Performance

How to Use Janus for Multi-Modal Integration

Janus Image Generation

How to Use Janus for Image Generation

Conclusion

Top comments (0)

Read next

Practical Experience: Integrating Over 50 Neural Networks Into One Open-Source Project

Why Spaces Are Encoded: %20 with encodeURI and +(plus) with URL / Differences Between encodeURI and URL

Build a Shopping Cart App with React, TypeScript, and Material-UI 🚀

Let's Learn Unit Testing in Python with pytest! 🚀