If you've been paying attention for the past two years, you've likely noticed that we are in the midst of a quiet revolution in the Artificial Intelligence (AI) industry. This shift is due to the advancement of large foundational models.
If you're in the AI field, you've probably heard or are familiar with names like GPT, Bert, and Dali. But today, I want to introduce you to one more that is joining the list. Microsoft Azure's Cognitive Services for Vision are launching their own large foundational model for the first time, which they've named "Florence".
What is Florence?
Florence is a truly astonishing model. Fueled by a massive amount of data (billions of pairs of text and images), it stands out for its multimodal capability. This means that it combines language and vision abilities, enabling incredible things in the field of vision AI, like retrieving images from text and also generating detailed descriptions.
Before, if you wanted to train a vision model, you needed a specific dataset for each task. For example, if you wanted to train a model for object detection, you needed to label the data for that task and train a specific model for it.
Florence changes the game
With Florence, you can train a large model with a broad dataset and then adapt it to individual tasks using what are known as "adaptation models". These adaptation models are fine-tuned with additional data for each specific task, opening up a host of possibilities, from image classification and retrieval, to object detection, segmentation, and captioning.
Florence's training is also impressive. It not only uses image labels, but it trains with pairs of images and text, allowing for deeper, more enriching learning. This image and text pair is processed through contrastive learning, which is a form of self-supervised learning.
The end result is a model that can do things that previously seemed like science fiction. For example, Florence can perform something called "image retrieval from text". If you ask it to "show me all the red cars", Florence knows what a red car looks like and can show you relevant images.
This advancement is a big step towards complete multimodality, where even more modalities can be integrated... We already have text, images, and, why not, another modality like video! In the future, these types of models could change the way we interact with images and text in the digital world.
The Florence Project is a clear demonstration of the impressive progress we are seeing in the field of AI. With its ability to understand and process both text and images, Florence is opening up a new universe of possibilities for the future of Artificial Intelligence. And honestly, it's extremely exciting.
Now, I invite you to visit the Vision Studio portal at portal.vision.cognitive.azure.com. Here, you can try out all the vision capabilities with a NO CODE experience. You can log in with your Azure user, but you can also test it publicly without logging into Azure. The features are organized by different types, for example, optical character recognition, spatial analysis, faces, image analysis, and in the Featured tab you can find the most recent functions.
Now let's talk about some features.
Add captions to images
Let's start with Florence's ability to add captions to images. This function was previously available in Azure's Cognitive Vision Service, which achieved human parity in generating captions for images in 2020. Now, thanks to the Large Foundation Model, this ability continues to significantly improve.
As an example, I decided to try with an image of mine.
In the 3.2 version of the API service, which was before the Foundation Model, this image gave me the result: "a man standing in front of a group of people in white clothing".
However, with the new update, the caption changed to "a man standing in front of a group of white storm troopers".
This is thanks to the open world recognition offered by Florence.
Open world recognition means that the model is capable of zero-shot recognition, as it has been trained with a large amount of data that allows it to recognize millions of object categories anywhere, from species and landmarks to logos, products, celebrities, and much more...
We've seen how this cognitive service generates a caption describing the content of an image by interpreting its context. Now I'll show you another feature of the service:** Add Dense captions to images.**
Add Dense captions to images
When I upload an image, the service not only provides a complete description of the image, but also the description of each of the regions of the image, detailing up to ten region descriptions and providing the bounding boxes that surround an object in these regions.
What's great about this is that it doesn't just detect objects, but it describes actions, like a boy kicking a soccer ball. It shows us up to ten descriptions of regions detected in the image and also the bounding box, which would be the rectangle that surrounds an object in one of these regions of the image.
The APIs it uses to extract this knowledge from the images are the image analysis APIs.
I have created a console application in .NET using C# to demonstrate the powerful capabilities of the Florence model. This app gives you hands-on experience with Azure's Cognitive Services for Vision. You can explore different features, such as adding captions to images and dense captions.
For those interested in testing this service using the console application, please visit my Github repository: Azure-ComputerVision-ImageAnalysis. Here, you'll find the necessary code and instructions to get started.
Search Photos with Image Retrieval
Lastly, I want to show you one of my favorite capabilities which is Search Photos with Image Retrieval. Here we have sets of images that you can use to search with natural language, even if you have not logged in with your Azure account.
Once you've logged in, you have the option to try it with your own images by creating a custom collection. As an example, I used some photos from my office, featuring interactions with various objects. This model, owing to its extensive training, is capable of recognizing and reasoning about a wide range of objects and scenes. Even without explicit labels, the model can identify and locate elements of interest within the images.
The portal Vision Studio allows you to upload random photos and conduct searches without any extra effort. The model takes care of everything, from extracting vectors from your images, whether they're in the cloud or on your local drive, to processing text queries and calculating the cosine distance-based similarity between the text and image vectors. This level of similarity determines the relevance of the search results.
Enhancing Image Recognition and Retrieval with Cosine Similarity in Transformer-Based Models
To provide a better understanding of how the model manages searches and determines relevance, it's worth discussing how it uses Cosine Similarity and the concept of vectors or 'embeddings'.
The Large Foundational Model, which this system is based upon, is a type of Transformer model. Up until now, the state of the art in computer vision was reliant on models based on convolutional networks. However, Large Foundational Models are all Transformer-based, thus introducing the concept of vectors.
Vectors, in terms of the API, are floating-point values. These are real numbers representing points in a high-dimensional space. What you see in an image is a three-dimensional representation, although the actual high-dimensional space of the Large Foundational Model can encompass up to 1,000 dimensions.
Each dimension corresponds to an attribute of the content, which could be semantic context, syntactic role, or the context in which it commonly appears. The vector space quantifies semantic similarity between vectors through methods like Cosine Similarity. This calculation is done by determining the angle between two vectors in space, the smaller the angle, the closer the vectors.
The demonstration you've seen earlier in Vision Studio uses Cosine Similarity to find the similarity between text vectors and image vectors. This underpins the search functionality and determines the relevance of the results.
Image Retrieval APIs
Now, let's take a look at the Image Retrieval APIs, which extract vectors from images and text.
We have the vectorized image and the vectorized text. The set of images that you want to explore is processed through the vectorized image API to extract the vectors. Then, you store the vectors in a search index along with the images. This search index is something you will have access to in the future when a user makes a text query.
The text query is inputted into the vectorized text function to extract the text vectors. Remember that, as the model is based on both language and vision, the vectors for text and images share the same high-dimensional space. The second step is to measure similarity, for which we've mentioned cosine similarity and other methods of calculating Euclidean distance. Next, you select the top image vectors, which are those closest to the text vectors.
Finally, the last step is to retrieve the images corresponding to the image vectors.
To better illustrate this concept, I'm sharing this GitHub code that demonstrates the functionality in action: Azure-ComputerVision-ImageRetrieval.
Additionally, I highly recommend this repository by Serge Retkowsky. In there, you'll find fantastic examples and demonstrations to kick-start your learning journey in the world of Azure Cognitive Services for Vision.
I hope this explanation has been greatly helpful! Feel free to leave your comments and questions.
👋Until next time, community