DEV Community

Francesco Tisiot
Francesco Tisiot

Posted on • Originally published at aiven.io

Image recognition with Python, OpenCV, OpenAI CLIP and pgvector

In the era of AI anything is a vector: from huge texts being parsed and categorized by Large Language Models (LLMs) to images being decomposed to find specific objects in them.

When asking questions to these models, the answer is defined by proximity: the set of stored vectors is parsed to find out the closest one (or set) in terms of distance, angle or similar metric.

If the entire vectorised dataset can be hosted in memory, no problem; but what happens when data gets big? This is where solving the problem with tools that are aimed at storing huge datasets can help, even better if they expose the search functionalities in a known language (SQL) and without the need to extract the entire dataset each time. In our case the tool is PostgreSQL and the vector functionality is provided by the newly released in Aiven for PostgreSQL pgvector extension.

We'll recreate a familiar use-case: you're at an event, and a friend or photograper takes a lot of pictures which are then shared with all the participants. How to identify all the pictures where you are included without having to browse them all? We recently had our yearly face to face meeting at Aiven, called crabweek, so I had the perfect dataset to start playing with the vector representation and search.

Vector representation, embeddings and search

An information can be stored in several ways, just think about the sentence I Love Parks: you could represent it in a table with three columns to flag the presence or not of each word (I, LOVE and PARKS) as per image below:

Table containing three columns named I, LOVE and PARKS with value 1

This is a lossless method, no information (apart from the order of words) is lost with this encoding. The drawback tho is that the number of columns grows with the number of distinct words within the sentence. For example, if we try to also encode Love Croissants with the same structure we'll end up with four columns I, LOVE, PARKS and CROISSANTS as shown below.

Table containing three columns named LOVE, PARKS, CROISSANTS with value 0 or 1 depending on the presence of the world in the phrase

Embeddings

What are embeddings then? As mentioned above, storing the presence of each word in a separate column would create a very wide and unmanageable dataset. Therefore a standard approach is to try to reduce the dimensionality by aggregating or dropping some of the redundant or not very distiguishable information. In our previous example, we could still encode the same information by:

  • dropping the I column since it doesn't add any value (it's always 1)
  • dropping the CROISSANTS column since we can still distinguish the two sentences by the presence of the PARK word.

If we visualize the two sentences above in a graph only using the LOVE and PARKS axis (therefore excluding the I and CROISSANTS), the result shows that I Love Parks is encoded as (1,1) since it has present both the LOVE and the PARKS words. On the other hand I Love Croissants is encoded with (1,0) since it includes LOVE but not PARKS.

Graph showing the phrases Love Parks and Love Croissants being encoded in the axes LOVE and PARKS

In the graph above, the distance represents a calculation of similarity between two vectors: The more two vectors point to the same direction or are close to each other, the more the information they represent should be similar.

Does this work with pictures?

A similar approach also works for pictures. As beautifully explained by Mathias Grønne and visualized in the image below (taken from the above blog), an image is just a series of characters in a matrix, and therefore we could reduce the matrix information and create embeddings on it.

Image encoding

Setup Face recognition with Python and PostgreSQL pgvector

If you, like me, use IPhotos on Mac, you'll be familiar with the "People" tab, where you can select one person and find the photos where this person is included. I tried the same setup with the pictures coming from crabweek, you're invited to run the above code, with adaptations, on top of any folder containing images.

Since images are sensible data, we don't want to rely on any online service or upload them to the internet. The entire pipeline defined below is working 100% locally.

The data pipeline will involve several steps:

  • Download all the pictures in a local folder
  • Retrieve the faces included in any picture
  • Calculate the embeddings from the faces
  • Store the embedidngs in PostgreSQL in a vector column from pgvector
  • Get a colleague picture from Slack
  • Identify the face in the picture (needed since people can have all types of pictures in Slack)
  • Calculate the embeddings in the Slack picture
  • Use pgvector distance function to retrieve the closest faces and therefore photos

The entire flow is shown in the picture below:

Entire pipeline for Face recognitions

Retrieve the faces from photos

An ideal dataset to calculate embeddings would contain only pictures of one person at the time, looking straight in the camera with minimal background. As we know, this is not the truth for event pictures, where a multitude of people is commonly grouped together with various backgrounds. Therefore, to create a machine learning algorithm that will be able to find a person included in a picture, we need to isolate the faces of the people withing the photos and create the embeddings on the faces rather than over the entire photos.

Faces being extracted from the picture

To "extract" faces from the pictures we used Python, OpenCV a computer vision tool and a pre-trained Haar Cascade model, the description of the process can be found in this article.

To get it working, we just need to install the opencv-python package with:

pip install opencv-python
Enter fullscreen mode Exit fullscreen mode

Download the haarcascade_frontalface_default.xml pre-trained Haar Cascade model from the OpenCV GitHub repository and store it locally.

Insert the code below in a python file, replacing the <INSERT YOUR IMAGE NAME HERE> with the path to the image you want to identify faces from and <INSERT YOUR TARGET IMAGE NAME HERE> to the name of the file where you want to store the face to.

# importing the cv2 library
import cv2
# loading the haar case algorithm file into alg variable
alg = "haarcascade_frontalface_default.xml"
# passing the algorithm to OpenCV
haar_cascade = cv2.CascadeClassifier(alg)
# loading the image path into file_name variable
file_name = '<INSERT YOUR IMAGE NAME HERE>'
# reading the image
img = cv2.imread(file_name, 0)
# creating a black and white version of the image
gray_img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
# detecting the faces
faces = haar_cascade.detectMultiScale(gray_img, scaleFactor=1.05, minNeighbors=2, minSize=(100, 100))

# for each face detected
for x, y, w, h in faces:
        # crop the image to select only the face
        cropped_image = img[y : y + h, x : x + w]
        # loading the target image path into target_file_name variable
        target_file_name = '<INSERT YOUR TARGET IMAGE NAME HERE>'
        cv2.imwrite(
            target_file_name,
            cropped_image,
        )
Enter fullscreen mode Exit fullscreen mode

The line that performs the magic is:

faces = haar_cascade.detectMultiScale(gray_img, scaleFactor=1.05, minNeighbors=2, minSize=(100, 100))
Enter fullscreen mode Exit fullscreen mode

Where:

  • gray_img is the source image where we need to find faces
  • scaleFactor is the scaling factor, the higher ratio the more compression and more loss in image quality
  • minNeighbors the amount of neighbour faces to collect. The higher the more the same face could appear multiple times.
  • minSize the minimum size of a detected face, in this case would be a square of 100 pixels.

The for loop iterates over all the faces detected and stores them in separated files, you might want to define a variable (maybe using the x and y parameters) to store the various faces in different files. Moreover, if you plan to calculate embeddings over a series of pictures, you'll want to encapsulate the above code in a loop parsing all the files in a specific folder.

The result of the face detection stage is not perfect: it identifies three faces out of the four that are visible, but is good enough for our purpose. You can fine tune the algorithm parameters to find the better fit for your use cases.

Calculate the embeddings

Once we identified the faces, we can now calculate the embeddings. For this step we are going to use the imgbeddings, a Python package to generate embedding vectors from images, using OpenAI's CLIP model via Hugging Face transformers.

To calculate the embeddings of a picture, we need to first install the required packages via

pip install imgbeddings
pip install pillow 
Enter fullscreen mode Exit fullscreen mode

And then include the following in a Python file

# importing the required libraries
import numpy as np
from imgbeddings import imgbeddings
from PIL import Image
# loading the face image path into file_name variable
file_name = "INSERT YOUR FACE FILE NAME"
# opening the image
img = Image.open(file_name)
# loading the `imgbeddings`
ibed = imgbeddings()
# calculating the embeddings
embedding = ibed.to_embeddings(img)
Enter fullscreen mode Exit fullscreen mode

The comment above calculates the embeddings, the result is a numpy vector of 768 elements representing the image embeddings.

Store embeddings in PostgreSQL using pgvector

It's time to start using the capability of PostgreSQL and the pgvector extension. First of all we need a PostgreSQL up and running, we can navigate to the Aiven Console, create a new PostgreSQL selecting the favourite cloud provider, region and plan and enabling extra disk storage if needed. The pgvector extension is available in all plans. Once all the settings are ok, you can click on Create Service.

Once the service is up and running (it can take a couple of minutes), navigate to the service Overview and copy the Service URI parameter. We'll use it to connect to PostgreSQL via psql with:

psql <SERVICE_URI>
Enter fullscreen mode Exit fullscreen mode

Once connected, we can enable the pgvector extension with:

CREATE EXTENSION vector;
Enter fullscreen mode Exit fullscreen mode

And now we can create a table containing the picture name, and the embeddings with:

CREATE TABLE pictures (picture text PRIMARY KEY, embedding vector(768));
Enter fullscreen mode Exit fullscreen mode

Check out the embedding vector(768), we are defining a vector of 768 dimensions, exactly the same dimension as the output of the ibed.to_embeddings(img) function in the previous step.

To load the embedding in postgreSQL we can use psycopg2 by installing it with

pip install psycopg2
Enter fullscreen mode Exit fullscreen mode

and then using the following Python code always replacing the <SERVICE_URI> with the service URI

# importing the required libraries
import psycopg2
conn = psycopg2.connect('<SERVICE_URI>')

cur = conn.cursor()
cur.execute('INSERT INTO pictures values (%s,%s)', (file_name, embedding.tolist()))
conn.commit()
conn.close()
Enter fullscreen mode Exit fullscreen mode

Where file_name and embedding are the variables from the previous Python statement.

Get Slack image, retrieve face and calculate embeddings

The following steps in the process are similar to the ones already done above, this time the source image is the Slack profile picture where we'll detect the face and calculate the embeddings. The code above can be reused by changing the location of the source image.

Calculate embeddings from Slack picture

The code below can give you a starting point

# loading the image path into file_name variable
file_name = '<INSERT YOUR SLACK IMAGE NAME HERE>'
# reading the image
img = cv2.imread(file_name, 0)
# creating a black and white version of the image
gray_img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
# detecting the faces
faces = haar_cascade.detectMultiScale(gray_img, scaleFactor=1.05, minNeighbors=2, minSize=(100, 100))

# for each face detected in the Slack picture
for x, y, w, h in faces:
        # crop the image to select only the face
        cropped_image = img[y : y + h, x : x + w]
        ibed = imgbeddings()
        # calculating the embeddings
        slack_img_embedding = ibed.to_embeddings(cropped_image)
Enter fullscreen mode Exit fullscreen mode

Since Slack pictures could be complex, the above code has a for loop iterating over all the detected faces. You might want to add additional checks to find the most relevant face where to calculate the embeddings from.

Find similar images with vector search

The final piece of the puzzle is to use the similarity functions available in pgvector to find pictures where the person is included. pgvector provides different similarity functions, depending from the type of search we are trying to perform.

We'll use the distance function, that calculates the euclidean distance between two vectors for our search. To find the other pictures with closest distance we can use the following query in Python:

conn = psycopg2.connect('<SERVICE_URI>')

cur = conn.cursor()
string_representation = "".join(str(x) for x in slack_img_embedding.tolist())
cur.execute("SELECT picture FROM pictures ORDER BY embedding <-> %s LIMIT 5;", (string_rep,))
rows = cur.fetchall()
for row in rows:
    print(row)
Enter fullscreen mode Exit fullscreen mode

Where slack_img_embedding is the embeddings vector calculated on top of the Slack profile picture at the previous step. If everything is working correctly, you'll be able to see the name of top 5 pictures that are similar to the Slack profile image as input.

The results, in the crabweek case where five photos where my colleague Tibs was included!

Pictures of Tibs

pgvector, enabling Machine Learning in PostgreSQL

Machine Learning is becoming pervasive in the day to day activities. Being able to store, query and analyse data embeddings in the same technology where the data resides, like a PostgreSQL databes, could provide a number of benefits in the machine learning democratisation and enable new use cases achievable by a standard SQL query.

To know more about pgvector and Machine Learning in PostgreSQL:

Top comments (1)

Collapse
 
slyuser profile image
sly user

Your breakdown of AI and vector representations' impact on data analysis is incredible! You've detailed a process, from encoding to face recognition, using PostgreSQL's pgvector extension, showcasing practical uses. Your accessible explanations and included code snippets make this complex topic understandable for many, highlighting how PostgreSQL integrates seamlessly for machine learning. I Definately want to use opencv for machine learing to this level rather than opencv template matching. Kudos on inspiring others (and me) in this innovative field!