Masked language modelling with multimodal transformers

It may not be clear from the question, but how can we apply masked language modelling with text and image given using multimodal models such as VisualBERT or CLIP? For example, if some text is given (it's Masked) and we mask some word in it, how can we apply MML to predict the word as cat? Is it possible to give only the text to the model, without the image? How can we implement such a thing and get MLM estimates from it using the huggingface library API? A code snippet explaining this would be great. If anyone can help, it would help to have a better understanding.