We perceive the world in a multimodal manner, combining information from our various senses — such as sight, hearing, smell, touch, and taste — to form a comprehensive understanding of our surroundings. To develop AI models capable of making decisions as well as, or better than, humans, it is essential for these models to also consider multimodal data. Furthermore, AI models must be aware of the confidence levels in their decisions, as incorrect decisions can lead to catastrophic outcomes. In this tutorial, we present a simple guide on how to use the LUMA multimodal dataset to introduce varying levels of uncertainty in the data and estimate the model’s uncertainty.
Uncertainty Quantification
Machine Learning and Deep Learning now drive a wide range of products and applications that we use daily, from image editing software to self-driving cars. These applications often process diverse types of information, including audio, images, text, and sensor data. To build Deep Learning models that perform well, it is crucial to integrate all these types of information during training. We refer to these various forms of data as “data modalities,” and the deep learning models that utilize them are known as Multimodal Deep Learning models.
Similar to conventional deep learning models, Multimodal Deep Learning models also suffer from overconfidence. Overconfidence occurs when a model assigns excessively high probabilities to its predictions, even when they are incorrect. This can often lead to catastrophic results. For example, a confidently wrong prediction in self-driving cars, can lead to injury or death of the passengers, as happened in 2016. To exclude such scenarios, we need to understand how confident really deep learning models are in their predictions. Uncertainty Quantification (UQ) serves this purpose and tries to quantify uncertainties in the data and in the trained model.
Bayesian statistics mostly distinguishes between two types of uncertainties: aleatoric and epistemic. The aleatoric uncertainty refers to the uncertainty inherent in data and cannot be reduced by observing more data. For example, if we look at the image below, we can see that the two classes are mixed, and it is hard to infer what the label of a new point on shall be in the mixed regions. Adding more data, will not make the classification easier.
Epistemic uncertainty, on the other hand, is the uncertainty of the model due to lack of knowledge. For example, in the image above, we see that we don’t have enough data points to confidently say which decision boundary is the best one. In contrast to the aleatoric uncertainty, in this case if we add more data points it can help to acquire additional information and hence, reduce the epistemic uncertainty.
In Multimodal Deep Learning we can have more complex interactions between uncertainties in modalities. It is possible to have a complementary information, which shall reduce the uncertainties, or to have conflicting information, which can increase the uncertainties.
In this blog post we will try to explore different uncertainty scenarios and measure the corresponding uncertainties on LUMA multimodal dataset¹.
LUMA Dataset
We are going to use the LUMA dataset, which allows us to inject different types of noises into each of the modalities and observe the changes in uncertainties. LUMA dataset is comprised of three modalities: audio, image and text. Image modality contains small 32x32 images of different objects. The audio modality contains the pronunciation of the labels of this object, and the text modality contains text passages about the objects. In total there are 50 classes, 42 of which are designed for model training and testing, and another 8 are provided as out-of-distribution data.
First, we need to download and compile the dataset. For that, we need to go to our command line interface (bash in my case), and run the following command, which will clone the LUMA dataset compiler and noise injector:
git clone https://github.com/bezirganyan/LUMA.git
cd LUMA
Then, we need to install the dependences by creating and activating a conda
environment (make sure you have anaconda or miniconda installed):
conda env create -f environment.yml
conda activate luma_env
Having all the dependencies, we can download the dataset to data
directory with:
git lfs install
git clone https://huggingface.co/datasets/bezirganyan/LUMA data
Finally, we can compile different dataset versions with different types and amounts of noises in each modality. For compiling the default dataset (i.e. without additional noises), we need to run:
python compile_dataset.py
Now, the LUMA tool allows us to inject different types of noises.
- Sample Noise — This type of noise adds realistic noise to each of the modalities. For example, for text modality, it can replace words with antonyms, add typo noise, spelling errors, etc. For audio modality, it can add background conversations, typing noises, etc. And for the image modality, noises like blur, defocus, frost, etc., can be added.
- Label Noise — This type of noise, randomly witches the labels of the data samples to their closest classes, which shall increase the mixture between classes.
- Diversity — This controls how divers the data points are. If we want to reduce the diversity, then the data points will be more concentrated in the latent space, which means the models will have less information to work with.
- Out-of-distribution (OOD) sample — The LUMA dataset also provides us with OOD samples, which means that they are samples that are outside the training distribution. Ideally, the ML model shall have high uncertainty on these kinds of samples, so that it doesn’t make a confidently wrong decision on a distribution it hadn’t seen before.
Let’s separately inject these noises. To control the amount of noises, we can modify (or create) the configuration file in cfg
folder. Nevertheless, there are already some pre-configured options available, that we will use. For sample noise, we can make use of pre-defined configuration file cfg/noise_sample.yml
. In particular, we can pay attention to theses lines in configuration for each modality:
sample_noise:
add_noise_train: True
add_noise_test: True
They turn on or off the sample noise per modality. The lines immediately below, control noise parameters, and are different for each modality. For audio they look like this:
sample_noise:
add_noise_train: True
add_noise_test: True
noisy_data_ratio: 1
min_snr: 3
max_snr: 5
output_path: data/noisy_audio
where we can control the noisy data ratio (0.0–1.0), minimum and maximum signal-to-noise ratio, and where to save the noisy audio files.
For text, they look like this:
sample_noise:
add_noise_train: True
add_noise_test: True
noisy_data_ratio: 1
noise_config:
KeyboardNoise:
aug_char_min: 1
aug_char_max: 5
aug_word_min: 3
aug_word_max: 8
BackTranslationNoise:
device: cuda # cuda or cpu
...
Here, you can specify noises from: KeyboardNoise
, BackTranslationNoise
, SpellingNoise
, OCRNoise
, RandomCharNoise
, RandomWordNoise
, AntonymNoise
. The parameters for each noise can be found here.
Finally, for image modality, the configuration looks like this:
sample_noise:
add_noise_train: True
add_noise_test: True
noisy_data_ratio: 1
output_path: data/noisy_images.pth
noise_config:
gaussian_noise:
severity: 4
shot_noise:
severity: 4
impulse_noise:
severity: 4
You can choose noises from: gaussian_noise
, shot_noise
, impulse_noise
,
defocus_blur
, frosted_glass_blur
, motion_blur
, zoom_blur
, snow
, frost
, fog
, brightness
, contrast
, elastic
, pixelate
, jpeg_compression
. For each of noises, you can specify a severity parameter, which obtains values from 1–5. Below you can see the examples of different noise types for image:
Then, we can compile the datset with sample noise with:
python compile_dataset.py -c cfg/noise_sample.yml
You can of course use any other configuration files.
To add label noise, one only needs to change the label_switch_prob
for each modality. As an example, one can look at cfg/noise_label.yml
. Finally, for diversity, one needs to change the compactness parameter. The higher the compactness value, the less diverse the data will be. An example of this can be seen in cfg/noise_diversity.yml
.
The OOD data for each generation is saved in a separate file specified in the configuration file.
Loading the Dataset in PyTorch
We can use the class from dataset.py
to load the dataset in PyTorch.
from dataset import LUMADataset
train_audio_path = 'data/audio/datalist_train.csv'
train_text_path = 'data/text_data_train.tsv'
train_image_path = 'data/image_data_train.pickle'
train_audio_data_path = 'data/audio'
train_dataset = LUMADataset(train_image_path,
train_audio_path,
train_audio_data_path,
train_text_path)
Nevertheless, this will return a raw texts, audios and images, which may not be very comfortable to use in our models. Hence, we would like to process this samples before using them in our models and convert them to more convenient formats. For audio we would like to convert the raw audio data to mel-spectrograms. For that we will define a transform as:
from torchvision.transforms import Compose
from torchaudio.transforms import MelSpectrogram
import torch
class PadCutToSizeAudioTransform():
def __init__(self, size):
self.size = size
def __call__(self, audio):
if audio.shape[-1] < self.size:
audio = torch.nn.functional.pad(audio, (0, self.size - audio.shape[-1]))
elif audio.shape[-1] > self.size:
audio = audio[:, :self.size]
return audio
audio_transform = Compose([MelSpectrogram(), PadCutToSizeAudioTransform(128)])
Here we use the MelSpectrogram
transform, and then use a custom transform to pad/cut the spectrogram into the same size for all samples.
For text data, we choose to use the average Bert embeddings for training. To do that we can extract the text features into a file, and then define a custom transform for loading the embeddings instead of raw text:
from data_generation.text_processing import extract_deep_text_features
extract_deep_text_features(train_text_path, output_path='text_features_train.npy')
class Text2FeatureTransform():
def __init__(self, features_path):
with open(features_path, 'rb') as f:
self.features = np.load(f)
def __call__(self, text, idx):
return self.features[idx]
text_transform=Text2FeatureTransform('text_features_train.npy')
For the image modality, we will normalize the images and convert them to tensors:
from torchvision.transforms import ToTensor, Normalize
image_transform = Compose([
ToTensor(),
Normalize(mean=(0.51, 0.49, 0.44),
std=(0.27, 0.26, 0.28))
])
Finally, we will apply these transforms by passing them to the dataset class:
train_dataset = LUMADataset(train_image_path, train_audio_path, train_audio_data_path, train_text_path,
text_transform=text_transform,
audio_transform=audio_transform,
image_transform=image_transform)
We can load test and OOD data in a similar fashion. The final data loading procedure will be:
import torch
from torchaudio.transforms import MelSpectrogram
from torchvision.transforms import Compose, Normalize, ToTensor
from data_generation.text_processing import extract_deep_text_features
from dataset import LUMADataset
train_audio_path = 'data/audio/datalist_train.csv'
train_text_path = 'data/text_data_train.tsv'
train_image_path = 'data/image_data_train.pickle'
audio_data_path = 'data/audio'
test_audio_path = 'data/audio/datalist_test.csv'
test_text_path = 'data/text_data_test.tsv'
test_image_path = 'data/image_data_test.pickle'
ood_audio_path = 'data/audio/datalist_ood.csv'
ood_text_path = 'data/text_data_ood.tsv'
ood_image_path = 'data/image_data_ood.pickle'
class PadCutToSizeAudioTransform():
def __init__(self, size):
self.size = size
def __call__(self, audio):
if audio.shape[-1] < self.size:
audio = torch.nn.functional.pad(audio, (0, self.size - audio.shape[-1]))
elif audio.shape[-1] > self.size:
audio = audio[:, :self.size]
return audio
class Text2FeatureTransform():
def __init__(self, features_path):
with open(features_path, 'rb') as f:
self.features = np.load(f)
def __call__(self, text, idx):
return self.features[idx]
extract_deep_text_features(train_text_path, output_path='text_features_train.npy')
extract_deep_text_features(test_text_path, output_path='text_features_test.npy')
extract_deep_text_features(ood_text_path, output_path='text_features_ood.npy')
image_transform = Compose([
ToTensor(),
Normalize(mean=(0.51, 0.49, 0.44),
std=(0.27, 0.26, 0.28))
])
text_transform_train = Text2FeatureTransform('text_features_train.npy')
text_transform_test = Text2FeatureTransform('text_features_test.npy')
text_transform_ood = Text2FeatureTransform('text_features_ood.npy')
audio_transform = Compose([MelSpectrogram(), PadCutToSizeAudioTransform(128)])
train_dataset = LUMADataset(train_image_path, train_audio_path, audio_data_path, train_text_path,
text_transform=text_transform_train,
audio_transform=audio_transform,
image_transform=image_transform)
test_dataset = LUMADataset(test_image_path, test_audio_path, audio_data_path, test_text_path,
text_transform=text_transform_test,
audio_transform=audio_transform,
image_transform=image_transform)
ood_dataset = LUMADataset(ood_image_path, ood_audio_path, audio_data_path, ood_text_path,
text_transform=text_transform_ood,
audio_transform=audio_transform,
image_transform=image_transform)
Building Multimodal UQ model
For building the multimodal UQ model, we are going to use a recent multimodal approach based on evidential learning. Evidential deep learning³ is a method that enhances traditional deep learning models by not only making predictions but also providing a measure of uncertainty about those predictions. It leverages principles from Dempster-Shafer theory, a mathematical framework for evidence-based reasoning. This theory allows the model to combine different pieces of evidence to calculate degrees of belief, rather than a single deterministic output. Instead of just giving a single answer, evidential learning outputs a range of possible answers along with the confidence level in each.
Following the ideas presented by Xu et al., (2024), we are going to build evidential networks for each modality and combine them using their proposed conflictive opinion aggregation strategy (RCML⁴). The image classifier, hence, will look like this:
class ImageClassifier(torch.nn.Module):
def __init__(self, num_classes, dropout=0.3):
super(ImageClassifier, self).__init__()
self.image_model = torch.nn.Sequential(
torch.nn.Conv2d(3, 32, 3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Conv2d(32, 64, 3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Flatten(),
)
self.classifier = torch.nn.Linear(64 * 6 * 6, num_classes)
def forward(self, x):
image, audio, text = x
image = self.image_model(image.float())
return self.classifier(image)
Similarly, the audio and text classifiers will be:
class AudioClassifier(torch.nn.Module):
def __init__(self, num_classes, dropout=0.5):
super(AudioClassifier, self).__init__()
self.audio_model = torch.nn.Sequential( # from batch_size x 1 x 128 x 128 spectrogram
torch.nn.Conv2d(1, 32, 5),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Conv2d(32, 64, 3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Conv2d(64, 64, 3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Flatten()
)
self.classifier = torch.nn.Linear(64 * 14 * 14, num_classes)
def forward(self, x):
image, audio, text = x
audio = self.audio_model(audio)
return self.classifier(audio)
class TextClassifier(torch.nn.Module):
def __init__(self, num_classes, dropout=0.5):
super(TextClassifier, self).__init__()
self.text_model = torch.nn.Sequential(
torch.nn.Linear(768, 512),
torch.nn.ReLU(),
torch.nn.Dropout(dropout),
torch.nn.Linear(512, 256),
torch.nn.ReLU(),
torch.nn.Dropout(dropout),
)
self.classifier = torch.nn.Linear(256, num_classes)
def forward(self, x):
image, audio, text = x
text = self.text_model(text)
return self.classifier(text)
Having these uni-modal classifiers, we will combine them into a multimodal network:
class MultimodalClassifier(torch.nn.Module):
def __init__(self, num_classes, dropout=0.5):
super(MultimodalClassifier, self).__init__()
self.image_model = ImageClassifier(num_classes, dropout)
self.audio_model = AudioClassifier(num_classes, dropout)
self.text_model = TextClassifier(num_classes, dropout)
def forward(self, x):
image_outputs = self.image_model(x)
audio_outputs = self.audio_model(x)
text_outputs = self.text_model(x)
image_logits = torch.nn.functional.softplus(image_outputs)
audio_logits = torch.nn.functional.softplus(audio_outputs)
text_logits = torch.nn.functional.softplus(text_outputs)
logits = [image_logits, audio_logits, text_logits]
agg_logits = image_logits
for i in range(1, 3):
agg_logits = (agg_logits + logits[i])/2
return agg_logits, (image_logits, audio_logits, text_logits)
Here we use the softplus function, since in the evidential networks, evidences shall be non-negative numbers. The diagram of the architecture can be seen in the image below:
To make our training easier, we are going to use the PyTorch Lightning framework. For that, we need to define another lightning class:
import numpy as np
import pytorch_lightning as pl
import torch
from torchmetrics import Accuracy
from baselines.utils import AvgTrustedLoss
class DirichletModel(pl.LightningModule):
def __init__(self, model, num_classes=42, dropout=0.):
super(DirichletModel, self).__init__()
self.num_classes = num_classes
self.model = model(num_classes=num_classes, monte_carlo=False, dropout=dropout, dirichlet=True)
self.train_acc = Accuracy(task='multiclass', num_classes=num_classes)
self.val_acc = Accuracy(task='multiclass', num_classes=num_classes)
self.test_acc = Accuracy(task='multiclass', num_classes=num_classes)
self.criterion = AvgTrustedLoss(num_views=3)
self.aleatoric_uncertainties = None
self.epistemic_uncertainties = None
def forward(self, inputs):
return self.model(inputs)
def training_step(self, batch, batch_idx):
loss, output, target = self.shared_step(batch)
self.log('train_loss', loss)
acc = self.train_acc(output, target)
self.log('train_acc_step', acc, prog_bar=True)
return loss
def shared_step(self, batch):
image, audio, text, target = batch
output_a, output = self((image, audio, text))
output = torch.stack(output)
loss = self.criterion(output, target, output_a)
return loss, output_a, target
def validation_step(self, batch, batch_idx):
loss, output, target = self.shared_step(batch)
self.val_acc(output, target)
alphas = output + 1
probs = alphas / alphas.sum(dim=-1, keepdim=True)
entropy = self.num_classes / alphas.sum(dim=-1)
alpha_0 = alphas.sum(dim=-1, keepdim=True)
aleatoric_uncertainty = -torch.sum(probs * (torch.digamma(alphas + 1) - torch.digamma(alpha_0 + 1)), dim=-1)
return loss, output, target, entropy, aleatoric_uncertainty
def test_step(self, batch, batch_idx):
loss, output, target = self.shared_step(batch)
self.test_acc(output, target)
alphas = output + 1
probs = alphas / alphas.sum(dim=-1, keepdim=True)
entropy = self.num_classes / alphas.sum(dim=-1)
alpha_0 = alphas.sum(dim=-1, keepdim=True)
aleatoric_uncertainty = -torch.sum(probs * (torch.digamma(alphas + 1) - torch.digamma(alpha_0 + 1)), dim=-1)
return loss, output, target, entropy, aleatoric_uncertainty
def training_epoch_end(self, outputs):
self.log('train_acc', self.train_acc.compute(), prog_bar=True)
self.criterion.annealing_step += 1
def validation_epoch_end(self, outputs):
self.log('val_acc', self.val_acc.compute(), prog_bar=True)
self.log('val_loss', np.mean([x[0].detach().cpu().numpy() for x in outputs]), prog_bar=True)
self.log('val_entropy', torch.cat([x[3] for x in outputs]).mean(), prog_bar=True)
self.log('val_sigma', torch.cat([x[4] for x in outputs]).mean(), prog_bar=True)
def test_epoch_end(self, outputs):
self.log('test_acc', self.test_acc.compute(), prog_bar=True)
self.log('test_entropy_epi', torch.cat([x[3] for x in outputs]).mean())
self.log('test_ale', torch.cat([x[4] for x in outputs]).mean())
self.aleatoric_uncertainties = torch.cat([x[4] for x in outputs]).detach().cpu().numpy()
self.epistemic_uncertainties = torch.cat([x[3] for x in outputs]).detach().cpu().numpy()
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-2)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.33, patience=5,
verbose=True)
return {
'optimizer': optimizer,
'lr_scheduler': scheduler,
'monitor': 'val_loss'
}
Here we predict the correct class of the network, and also compute the aleatoric and epistemic uncertainties.
Training the Multimodal Model
For training we just need to define dataloaders, and use PyTorch Lightning Trainer class for training.
batch_size = 128
classes = 42
dropout_p = 0.3
train_dataset, val_dataset = torch.utils.data.random_split(train_dataset, [int(0.8 * len(train_dataset)),
len(train_dataset) - int(
0.8 * len(train_dataset))])
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=8)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=8)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=8)
ood_loader = torch.utils.data.DataLoader(ood_dataset, batch_size=batch_size, shuffle=False, num_workers=8)
# Now we can use the loaders to train a model
model = DirichletModel(MultimodalClassifier, classes, dropout=dropout_p)
trainer = pl.Trainer(max_epochs=300,
gpus=1 if torch.cuda.is_available() else 0,
callbacks=[pl.callbacks.EarlyStopping(monitor='val_loss', patience=10, mode='min'),
pl.callbacks.ModelCheckpoint(monitor='val_loss', mode='min', save_last=True)])
trainer.fit(model, train_loader, val_loader)
print('Testing model')
trainer.test(model, test_loader)
print('Test results:')
print(trainer.callback_metrics)
aleatoric_uncertainties = model.aleatoric_uncertainties
epistemic_uncertainties = model.epistemic_uncertainties
print('Testing OOD')
trainer.test(model, ood_loader)
aleatoric_uncertainties_ood = model.aleatoric_uncertainties
epistemic_uncertainties_ood = model.epistemic_uncertainties
auc_score = roc_auc_score(
np.concatenate([np.zeros(len(epistemic_uncertainties)), np.ones(len(epistemic_uncertainties_ood))]),
np.concatenate([epistemic_uncertainties, epistemic_uncertainties_ood]))
print(f'AUC score: {auc_score}')
Here we are logging the classification accuracy, the average uncertainty values and the AUC score for OOD detection.
For training on the noisy versions of the datasets, we just need to change the data paths to noisy data paths.
Training Results
On the clean data (without injecting additional noise), we get the following results:
As we can see, adding noise effectively raises the uncertainty metrics. An interesting research direction, hence, is to adjust the noise levels and see how the uncertainties change. It is essential not only to build DL models robust to these noises but find UQ methods that reliably can indicate when the models are unsure about their predictions.
Acknowledgements
This blog post is written based on the code and dataset of LUMA, published within the scope of my PhD thesis at Aix-Marseille University (AMU), CNRS, LIS. I would like to mention and thank my PhD Supervisors and paper co-authors Sana Sellami (AMU, CNRS, LIS), Laure Berti-Équille (IRD, ESPACE-DEV), and Sébastien Fournier (AMU, CNRS, LIS).
If you liked this port, please star LUMA at GitHub. We will be happy to hear you thoughts, questions or suggestions in the discussion below.
[1] Bezirganyan, G., Sellami, S., Berti-Équille, L., & Fournier, S. (2024). LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data. http://arxiv.org/abs/2406.09864 arXiv:2406.09864
[2] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics.
[3] Sensoy, M., Kandemir, M., & Kaplan, L.M. (2018). Evidential Deep Learning to Quantify Classification Uncertainty. ArXiv, abs/1806.01768.
[4] Xu, C., Si, J., Guan, Z., Zhao, W., Wu, Y., & Gao, X. (2024). Reliable Conflictive Multi-View Learning. AAAI Conference on Artificial Intelligence.
Top comments (0)