Jimmy Guerrero for Voxel51

Posted on May 10 • Updated on May 22 • Originally published at voxel51.com

Voxel51 Filtered Views Newsletter - May 10, 2024

#computervision #machinelearning #ai #datascience

Author: Harpreet Sahota (Hacker in Residence at Voxel51)

Welcome to Voxel51’s bi-weekly digest of the latest trending AI, machine learning and computer vision news, events and resources! Subscribe to the email version.

📰 The Industry Pulse

🎥 Will AI Kill the Video Star?

The release of the music video for The Hardest Part by Washed Out marks the first time that OpenAI’s Sora was commissioned for an AI-generated music video.

And its release has sparked a lively debate online. While many Redditors are captivated by its dreamlike visuals, describing it as "like dreaming while being conscious," others criticize technical shortcomings and a lack of narrative cohesion. Some comments highlight the impressive advancement in AI, with one user stating, "This technology is leaps and bounds better than anything that's ever existed in the last 10 years." However, others express disappointment, feeling the video doesn't live up to the initial hype. Despite the mixed reactions, there's a general sense of awe and curiosity surrounding SORA's potential, with discussions touching on AI consciousness and the future of creative expression.

As one user aptly says, "AI is dreaming, and we're peeking into its dreams."

👻gpt2-chatbot: The AI That Appeared and Disappeared, Leaving Us in Awe

A mysterious new AI chatbot named "gpt2-chatbot" recently appeared on the LMSYS Chatbot Arena, a website that compares various AI systems. The chatbot's origin is unknown, but its impressive capabilities have sparked intense speculation among AI experts.

Some researchers believe that gpt2-chatbot represents a significant leap over existing AI models, possibly matching or even exceeding the abilities of GPT-4, currently the most advanced system from OpenAI. In tests, the gpt2-chatbot solved complex math problems and performed better than GPT-4 on specific reasoning tasks.

The chatbot's name has led to speculation that it could be an early prototype of GPT-5 or an updated version of GPT-4 (dubbed GPT-4.5). However, OpenAI CEO Sam Altman and staff member Steven Heidel have made cryptic tweets hinting that gpt2-chatbot might be an unreleased model rather than an earlier version of GPT-2. Despite the excitement, LMSYS quickly took gpt2-chatbot offline due to "unexpectedly high traffic" and "capacity limit." The organization stated it has previously worked with developers to offer access to unreleased models for testing.

This mysterious chatbot's origin and full potential remain the subject of much speculation and anticipation.

💻MIT Professor Creates Specialized Programming Languages for Efficient Visual AI

MIT's Jonathan Ragan-Kelley is tackling the challenge of utilizing increasingly complex hardware for visual AI applications like graphics and image processing. He achieves this by developing specialized programming languages that prioritize efficiency.

Two key approaches:

Domain-Specific Languages: Ragan-Kelley designs languages like Halide, tailored for specific tasks like image processing, to maximize efficiency.

Compiler Optimization: He focuses on automating how programs are mapped to hardware using compilers, balancing control and productivity.

His team is creating "user-schedulable languages" that offer high-level control over compiler optimization and exploring machine learning for generating optimized schedules. They also employ "exocompilation" to adapt compilers for specialized AI hardware.

Furthermore, Ragan-Kelley's group is rethinking large language models (LLMs) for improved efficiency on AI hardware without sacrificing accuracy. He believes these advancements will unlock the full potential of new machines and accelerate the development of cutting-edge applications.

👨🏽‍💻 GitHub Gems

StoryDiffusion can generate consistent, long-range image sequences and videos.

It uses consistent self-attention to maintain character styles and attires across multiple frames, enabling cohesive visual storytelling.

Some key capabilities of StoryDiffusion include:

Generating comics in various styles while keeping characters consistent
Producing high-quality videos conditioned on either its own generated images or user-provided images
Creating impressive cartoon-style characters that remain consistent across frames
Handling multiple characters simultaneously and preserving their identities throughout an image sequence

The system relies on two main components:

A Consistent Self-Attention module that enforces consistency in generated images
A Motion Predictor that generates videos based on a sequence of input images

I highly recommend checking it out:

📙 Good Reads

In this week's Good Reads, we follow the white rabbit down the hole with Simone Scardapane's "Alice's Adventures in a Differentiable Wonderland: A Primer on Designing Neural Networks."

This book, freely available on arXiv, goes deep into the core concepts and components of neural networks, offering a balanced blend of theory and practical application.

Scardapane guides readers through the fundamental principles of designing and implementing these powerful models, drawing parallels to Alice's journey through Wonderland. He emphasizes the increasing importance of scaling laws in the field, highlighting how larger models and datasets lead to significant improvements in accuracy. The book also acknowledges the rise of foundation models and the shift towards prompting pre-trained models for various tasks.

However, Scardapane argues that understanding the inner workings of these models remains crucial for customization, fine-tuning, and innovation. He provides readers with the tools to explore "under the hood" and grasp the intricacies of neural network design.

Here's a glimpse of what you'll discover:

Mathematical foundations: A refresher on linear algebra, gradients, Jacobians, and optimization techniques essential for understanding neural networks.
Datasets and losses: This section explores the different types of datasets and loss functions used in training models, including concepts like overfitting and maximum likelihood estimation.
Building blocks of neural networks: This section goes on linear models, fully connected networks, convolutional layers, and recurrent models, along with practical considerations for implementation.
Advanced architectures: Unveiling the power of transformers and graph neural networks and their applications in handling complex data structures.
Scaling and optimization: Techniques for efficiently training large-scale models, including regularization methods, normalization, and residual connections.

🎙️ Good Listens: From Classroom to Corner Office: The Voxel51 Founding Story 🎧

This week’s Good Listen pulls back the curtain on Voxel51's origin story with a special episode featuring our co-founders, Brian Moore and Prof. Jason Corso, on the How I Met My Co-founder podcast!

Their journey began in the unlikely setting of a University of Michigan computer vision course in 2014. From professor and student to co-founders, they've navigated the exciting and often challenging world of AI startups, building Voxel51 into the company it is today.

Tune in to hear them discuss:

Bridging the gap: Transitioning from academia to the fast-paced world of entrepreneurship.
Building on a strong foundation: How their shared passion for technology and aligned vision formed the bedrock of Voxel51.
Evolving roles: Navigating leadership changes and adapting to the company's needs, including switching CEO roles five years in.
Taking the leap: Making a major business model pivot and the decision-making process behind it.
Navigating disagreements: Overcoming challenges like differing opinions on fundraising timelines and finding common ground.

Get an inside look at:

The early days of Voxel51 and their initial focus on applied research.
Their "leadership by committee" approach and the importance of equitable decision-making.
How they tackled one of their biggest disagreements: deciding when to fundraise.
The pivotal moment they chose to pivot the company's direction.
The power of trust, commitment, and open communication in a founding team.

Listen to Brian and Jason share their experiences, learnings, and the story behind Voxel51's success!

👨🏽‍🔬 Good Research

This week, we’re dissecting MM-LLMs: Recent Advances in MultiModal Large Language Models.

It’s a survey paper, so I’ll refrain from giving it the PACES treatment. The paper covers a lot of ground, but I want to zero in on one part—the model architecture of MM-LLMs. According to the paper, the general model architecture of MM-LLMs consists of five components: the Modality Encoder, Input Projector, LLM Backbone, Output Projector, and Modality Generator.

These components work together to process and generate content across multiple modalities.

Modality Encoder

In MM-LLMs, the Modality Encoder is typically implemented using pre-trained neural networks specifically designed and trained for processing data from a particular modality.

These pre-trained models have learned to recognize and extract relevant features from the input data through extensive training on large-scale datasets. For example, popular pre-trained models such as ViT (Vision Transformer), CLIP ViT, or Swin Transformer are used as the Modality Encoder when dealing with visual modalities like images or videos. These models have been trained on massive image datasets and have learned to capture visual features such as edges, textures, objects, and scenes.

Similarly, pre-trained models like C-Former, HuBERT, or Whisper are used as the Modality Encoder for audio modality. These models have been trained on large audio datasets and have learned to extract relevant features such as phonemes, prosody, and speaker characteristics.

The Modality Encoder takes the raw input data. This input is processed through a pre-trained model to generate a set of encoded features. The encoded features are then passed through the Input Projector, which aligns them with the text feature space and generates prompts that can be fed into the LLM Backbone. This alignment process allows the LLM Backbone to understand and reason about the multimodal input in the context of the textual input.

During MM-LLM training, the Modeality Encoder is usually kept frozen, which means its parameters are not updated. This approach has advantages. It leverages pre-trained knowledge of the Modeality Encoder, reducing computational costs and promoting flexibility in the MM-LLM architecture. Different Modeality Encoders can be easily swapped or combined based on task requirements.

The Modality Encoder extracts meaningful features from raw input data in various modalities, aligning them with the text feature space and processing them by the LLM Backbone. This enables the MM-LLM to understand and reason about multimodal input in the context of textual input.

Input Projector

The Input Projector aligns the encoded features of other modalities with the text feature space.

The aligned features, called prompts, are fed into the LLM Backbone and textual features. The Input Projector can be implemented using various methods, such as:

Linear Projector: A simple linear transformation of the encoded features.
Multi-Layer Perceptron (MLP): Several linear projectors interleaved with non-linear activation functions.
Cross-attention (Perceiver Resampler): This method uses a set of trainable vectors as queries and the encoded features as keys to compress the feature sequence to a fixed length. The compressed representation is fed directly into the LLM or used for X-Text cross-attention fusion.
Q-Former: Extracts relevant features with learnable queries, and the selected features are then used as prompts.
P-Former: Generates "reference prompts" and imposes an alignment constraint on the prompts produced by Q-Former.
MQ-Former: Conducts a fine-grained alignment of multi-scale visual and textual signals.

The Input Projector enables MM-LLMs to process and understand the relationships between different modalities by aligning their features with the text feature space.

LLM Backbone

The LLM Backbone, which is the core agent of MM-LLMs, processes representations from various modalities and performs semantic understanding, reasoning, and decision-making.

It produces direct textual outputs and signal tokens from other modalities (if any) to guide the generator. The LLM Backbone is typically a pre-trained large language model, such as GPT, PaLM, Chinchilla, or LLaMA. These models have been trained on vast amounts of textual data and have acquired a deep understanding of language and its nuances. By leveraging these pre-trained LLMs, MM-LLMs can inherit their strong language understanding and generation capabilities.

The LLM Backbone takes two types of inputs:

Textual features (FT): The original text input is tokenized and transformed into a sequence of token embeddings.
Prompts from other modalities (PX): The encoded features of other modalities (e.g., images, videos, or audio) are aligned with the text feature space using the Input Projector and then fed into the LLM Backbone as prompts

The LLM Backbone processes the input text and the aligned features from other modalities (prompts) to generate two types of outputs:

Direct textual outputs (t)
Signal tokens (SX) from other modalities (if any)

The signal tokens (SX) are instructions to guide the Modality Generator on whether to produce multimodal content and, if so, what content to generate. However, these signal tokens are represented in the latent space of the LLM Backbone, which may not be directly compatible with the input format required by the Modality Generator.

This is where the Output Projector comes into play.

Output Projector

The Output Projector maps the signal token representations from the LLM Backbone into features the Modality Generator understands.

This transformation is necessary because the Modality Generator, such as a Latent Diffusion Model (LDM) for generating images, expects input features in a specific format and dimensionality. The Output Projector is typically implemented using one of the following methods:

Tiny Transformer: This is a small-scale Transformer model with a learnable decoder feature sequence. It takes the signal tokens SX as input and generates a sequence of features that can be processed by the Modality Generator.
Multi-Layer Perceptron (MLP): A series of linear layers with non-linear activation functions that transform the signal token representations into the desired feature space for the Modality Generator.

The main goal of the Output Projector is to bridge the gap between the LLM Backbone and the Modality Generator by transforming the signal tokens into a suitable format.

This allows the MM-LLM to generate outputs in various modalities, such as images, videos, or audio, based on the information processed by the LLM Backbone.

Modality Generator

The Modality Generator takes the mapped features from the Output Projector as input and produces outputs in different modalities, such as images, videos, or audio.

Its role is to generate high-quality content in the target modality, guided by the instructions provided by the LLM Backbone in the form of signal tokens. In most MM-LLMs, the Modality Generator uses off-the-shelf pre-trained models, particularly Latent Diffusion Models (LDMs). LDMs are generative models that learn to map a latent space to the target modality space. They have shown impressive results in generating high-quality and diverse content in various modalities.

Some popular LDMs used as Modality Generators in MM-LLMs include:

Stable Diffusion: A powerful image generation model that can create realistic and diverse images based on textual descriptions.
ZeroScope: A video generation model that can produce high-quality video clips based on textual prompts.
AudioLDM: An audio generation model that can synthesize realistic audio samples, such as speech or music, based on textual descriptions.

The generation process typically involves iterative refinement steps, where the model gradually updates the generated content based on the input features and its learned priors. During MM-LLM training, the Modality Generator is usually kept frozen, similar to the LLM Backbone. This allows the MM-LLM to leverage the Modality Generator's pre-trained knowledge and reduces the training's computational cost. The Output Projector learns to map the signal tokens from the LLM Backbone to the input space of the Modality Generator, enabling the generation of high-quality multimodal content.

By combining these components, MM-LLMs can process and generate content across multiple modalities, leveraging the strengths of pre-trained models while maintaining the flexibility to adapt to different tasks through the training of the Input and Output Projectors.

📣. Voxel51 Announcement

Excited to announce that the open source FiftyOne computer vision toolkit has crossed 2 Million downloads! It’s as easy as “pip install fiftyone”

Learn more on GitHub: https://github.com/voxel51/fiftyone

🗓️. Upcoming Events

Check out these upcoming AI, machine learning and computer vision events! View the full calendar and register for an event.

DEV Community