loading...

Programming Snapchat-Like Filters

unqlite_db profile image Vincent ・13 min read

Since you probably own a smartphone these days, you may have noticed someone from your entourage popping up on the social networks with a flower crown or some dog stuff on his head. This effect is produced by the Snapchat app, is named filters and every competitor of the app is copying that nowadays.

In this post, we'll try to desiccate and answer some questions about them: How they are actually made? what software libraries are needed to mimic their behavior? and finally, we'll implement some famous filters using Python or whatever language that support HTTP requests with the help of the PixLab API.

With no further ado, let's dig a little bit onto this Snapchat output (In fact, it is made by our program that we'll implement below):

snap filters

To produce such effect, two phases are actually needed: Analysis & Processing.

Computer Vision to the rescue

The analysis phase is always the first pass and is the most complicated. It require some computer vision algorithms running under the hood and performing the following tasks:

Face detection

Given an input image or video frame, find out all present human faces and output their bounding box (i.e. The rectangle coordinates in the form: X, Y, Width & Height).

Face detection has been a solved problem since the early 2000s but faces some challenges including detecting tiny, partial & non-frontal faces. The most widely used technique is a combination of Histogram of Oriented Gradients (HOG for short) and Support Vector Machine (SVM) that achieve mediocre to relatively good detection ratios given a good quality image but this method is not capable of real-time detection at least on the CPU.

Here is how the HOG/SVM detector works:

Given an input image, compute the pyramidal representation of that image which is a pyramid of multi scaled (maybe Gaussian) downed version of the original image. For each entry on the pyramid, a sliding window approach is used. The sliding window concept is quite simple. By looping over an image with a constant step size, small image patches typically of size 64 x 128 pixels are extracted at different scales. For each patch, the algorithm makes a decision if it contains a face or not. The HOG is computed for the current window and passed to the SVM classifier (Linear or not) for the decision to take place (i.e. Face or not). When done with the pyramid, a non-maxima suppression (NMS for short) operation usually take place in order to discard stacked rectangles. You can read more about the HOG/SVM combination here.

Facial Landmarks

This is the next step in our analysis phase and works as follows:

For each detected face, output the local region coordinates for each member or facial feature of that face. This includes the eyes, bone, lips, nose, mouth,... coordinates usually in the form of points (X,Y).

Extracting facial landmarks is a relatively cheap operation for the CPU given a bounding box (i.e. Cropped image with the target face), but quite difficult to implement for the programmer unless some not-so-fast machine learning approach is used.

You can find out more about extracting facial landmarks here or this PDF: One millisecond face alignment with an ensemble of regression trees.

facial landmarks

In some and obviously useful cases, face detection and landmarks extraction are combined into a single operation. Dlib achieve that (Nope, in fact Dlib require face detection to take place at first and shape extraction next so it's a two step operation).

PixLab achieve that in a single call via the facelandmarks API endpoint that we will be using later. We move now to the next part of our quest for the best snap: The processing phase.

Image & Video Processing

Once we have the facial landmarks, 70% of the work is done, all we have to do right now is to superpose the target filter such as the flower crown, dog nose, etc. on top of the desired region of the face like the bone for the flower crown case. This operation is named compositing. That is, combine multiple visual elements from separate sources into a single image.

For the sake of simplicity, we'll stick with image processing only. Video processing is similar in concept but require some additional steps such as extracting each frame using a decoder (i.e. FFmpeg or CvCapture from OpenCV) and treating that frame exactly like you would do with an image.

Now that we understand how those filters are created, it's time to start producing a few ones using some code in the next section!

Restful APIs come to the rescue

Fortunately for the app builder, producing those filters is quite easy if some cloud vision service is used instead of building & compiling your own. All you need to do is to make a simple HTTP request to the target service for the analysis phase to take place on behalf of you. The notable cloud providers are:

The former two (Microsoft & Google) offers machine vision only. In others words, you'll be able to detect faces, extract their shapes using state-of-the-art machine learning algorithms but, it is up to you to perform the processing phase. That is, composite the flower crown or whatever filter using your own image processing library.

PixLab on the other side offer both computer vision & media processing as a single set of unified Restful APIs and is shipped with over 130 API endpoints.

Enough talking now, let's start programming our filters...

Programming Our First Filter

Given an input image with some pretty faces:

input picture

and this flower crown:

flower crown

located at pixlab.xyz/images/flower_crown.png

Output something like this:

out snap 1

Using this code:

The program output when run should look like:

Detecting and extracting facial landmarks..

4 faces were detected

Coordinates...
        width: 223 height: 223 x: 376 y: 112

Landmarks...
        Nose: X: 479.3, Y: 244.2
        Bottom Lip: X: 481.7, Y: 275.3
        Top Lip: X: 481.6, Y: 267.1
        Chin: X: 486.7, Y: 327.3
        Bone Center: X: 490, Y: 125
        Bone Outer Left: X: 356, Y: 72
        Bone Outer Right: X: 564, Y: 72
        Bone Center: X: 490, Y: 125
        Eye Pupil Left: X: 437.1, Y: 177.3
        Eye Pupil Right: X: 543.7, Y: 173.3
        Eye Left Brown Inner: X: 458.2, Y: 161.1
        Eye Right Brown Inner: X: 504.9, Y: 156.1
        Eye Left Outer: X: 417.5, Y: 180.9
        Eye Right Outer: X: 559.4, Y: 175.6

Resizing the snap flower crown...

Composite operation...

Snap Filter Effect: https://pixlab.xyz/24p596187b0b1b68.jpg

If this is the first time you see the PixLab API in action, your are invited to read the excellent introduction to the API in 5 minutes or less.

To mimic such behavior, only three commands (API endpoints) were actually needed to produce such a filter:

  1. facelandmarks is the analysis command we called first on line 29 of the python gist. As said earlier, it is an all in one operation. It tries to detect all present human faces, extract their rectangle coordinates and more importantly, it outputs the landmarks for each facial feature such as the eyes, bone, nose, mouth, chin, lips, etc. of the target face. We'll use these coordinates later to composite stuff on top of the desired face region. In our case, only the bone region coordinates were actually needed to composite the flower crown. Refer to the PixLab documentation for additional information on the facelandmarks command.
  2. Now for each detected face, perform the following processing operations (We have done with the analysis phase at this stage):
  3. smartresize is called next on line 84 in order to fit the image to be composited (i.e. The flower crown) to the desired dimension such as the bone width of the target face. This is an essential step since the flower crown is quite big at this stage compared to the bone width. The bone width is assumed to be the same as the face width so we use that value. The engine is smart enough to calculate the height for us so we leave the height field untouched.
  4. merge aka composite is the main processing command we calls next. It expects a list of coordinates (X, Y) and a list of images to be composited (i.e. The flower crown) on top of the target region (The bone center in our case). The coordinates were collected on line 97 and passed verbatim to the merge command on line 108. Refer to the PixLab documentation for additional information on the merge command.
  5. Optionally, you may be tempted to use some photo filter commands such as: grayscale, blur, oilpaint, etc. for some cool background effects if desired.

We move now to a more complete example with the famous dog filter.

A More Complex Example

Given a input image:

input faces

and these dog facial parts:

dog parts

available separately:

Output something like this:

output snap

Using this code:

As you may notice, no matter how complex your filter, the logic is always the same: facelandmarks first, smartresize next and finally merge to generate the filter.

Tip: If you want to download the image output immediately without remote storage, set the Blob parameter to true in your final HTTP request (line 148).

Compositing the flower crown is quite easy to implement. All we have to do is to locate the bone center region of the target face and invoke merge with the X & Y coordinates of that region. Don't forget to set the center & center_y parameters for each entry the merge command takes to True. This will calculate the best position for the filter. The same rule apply to the dog nose except we used the face nose coordinates.

The difficulty here is to find the best location for the left & right ear. The bone left most region is selected for the left ear and we adjusted the Y position for optimal effect. The same rule apply for the right ear. The rotate command can be of particular help here if the target face is inclined to some degree.

One Last Example

Given this woman:

women sample

and this eye mask:

eye_mask

located at. pixlab.xyz/images/eye_mask.png

plus this mustache:

mustache

located at. pixlab.xyz/images/mustache.png

Output something like this with some text (i.e. MEME) on the bottom of the image:

women mask

Using this code:

Note how smartresize is of particular help here. If we did not rely on it, the mustache and the eye mask would occupy the entire face instead of a small region. When called, the PixLab engine should calculate the best output dimension for us, all we have to do is to pass the desired width and the height will be automatically adjusted for us (or vise versa).

We also called the drawtext command for the first time to draw some text on the bottom of the image. This command is so flexible that you can supply font name, size & colors, stroke width & height, etc. Take a look at the meme documentation for additional information.

Making Your Own Stuff

You may be tempted to execute the code listed above and see it in action. All you have to do is put your own API key instead of the dummy one. If you don't have a key, you may generate one from your PixLab dashboard. If you don't have a PixLab account, sign up here.

You are invited to take a look at the Github sample page for dozen of the others interesting samples in action such as censoring images based on their nsfw score, blurring & extracting human faces, making gifs etc. All of them are documented on the PixLab sample page.

Conclusion

As we saw, making Snpachat filters is a two step operation. The analysis phase is always the first pass but also the most complicated. For each detected face in the image or video frame, extract and record the facial landmarks for that face. Most filters require a tiny set of landmarks if not one like the bone region for the flower crown filter. We saw that few open source software libraries are capable of doing face detection and landmarks extraction at the same time except
Dlib and the future Symisc SoD but alternatives are always possible.

Once you have the facial landmarks, most of the complicated work is done. We can start the processing phase which is a relatively cheap composite operation. All we have to do is to superpose the target filter on top of the desired region of the face. We saw that GPUImage and MagickWand are potential candidate for our image processing tasks.

Our proposed solution is to rely on the PixLab API for this kind of task since it is capable of image analysis & processing at the same time and it avoid us all the hassle of integrating the necessary libraries needed to perform such a work. A simple HTTP request from any programming language to the target endpoint:

  • facelandmarks for facial landmarks extraction.
  • smartresize to fit the target filter to the desired dimension such as the face width for the flower crown case.
  • merge to superpose the target filter on top of the desired facial region.
  • Optionally, grayscale, drawtext, oilpaint, etc. for some background effect if desired.
  • Et Voila! you're done.

Finally, if you have any suggestions or critics, please leave a comment below.

Posted on by:

Discussion

pic
Editor guide
 

This is super cool! Have you tried any of the open source / free APIs to do similar? If so, how do they stack up?

 

If you are programming for iOS, you can use the Vision framework, which has face landmark detection built in. It works on device, so it's free without limits and you can do all that stuff in real time.

I have not yet tried the face landmark detection but I have successfully applied the rectangle detection from the Vision framework to extract the image from a TV to perform a classification on what channel is currently running, which worked quite well.

 

Yes, checkout the SOD embedded project by the same PixLab team which let you detect faces on mobile & iot devices at Real-Time.

sod.pixlab.io/
Github repo: github.com/symisc/sod