DEV Community

Cover image for Let's create a face dataset with unsplash dataset
Ramiro - Ramgen
Ramiro - Ramgen

Posted on • Updated on

Let's create a face dataset with unsplash dataset

I want to get more into dataset creation and exploration and when unsplash released a dataset a while back i knew that it was a good excuse to start.

Then thought seeing other dataset like FFHQ and similar face datasets that it would be interesting to make a pipeline to make a sub-dataset of faces from the unsplash dataset.

Alright so we are going to use the test dataset that unsplash provides but everything works with the full dataset if you are able to access that one, you can get more information in the github repo.

Woman with face landmarks plotted and a green face box

This articles is the summarized version of this video where i go more in depth in each part and we do a walkthrough of why and how we do things, hope you can check it out!

Awesome! We are going to use jupyter notebooks and python scripts, so let's go!

First we need the dataset

Here if you have the link to the full dataset change the download link to that and also the file output name.

mkdir ds
curl -L "" -s --output "./ds/"
tar -xf "./ds/" --directory ./ds/
Enter fullscreen mode Exit fullscreen mode

Here we make the ds folder, then download and uncompress the dataset, you should have 5 tab separated values files and 3 markdowns, we are going to focus on the photos file.

Loading the images

Now let's see how to load the images from the url of the dataset and have it ready for processing.

image_bytes = requests.get('')
image_bytes = image_bytes.content
image_stream = BytesIO(image_bytes)

Enter fullscreen mode Exit fullscreen mode

We use requests to get the image and then with BytesIO we read the stream of bytes then we can use the open method from the PIL library to load that image into the img_open variable, we can leave the variable in the last line and the notebook will display it.
Woman holding her hair

Computing the face box

Awesome now let's see how to get the face box and then display it with matplotlib, first we need the haarcascade model to get the face box coordinates.

This will create the models folder and download the file of the classifier.

mkdir models
curl -L "" -s --output ./models/haarcascade_frontalface_alt_tree.xml
Enter fullscreen mode Exit fullscreen mode

Great now let's load the model with CascadeClassfier, then we need to get the gray version of the image for the classifier, we do this by first getting the numpy array of the PIL imagen and then with cvtColor we get the gray image.

Then we feed this gray image to the classifier, getting a list of face cords that the classifier detects.
Then we can loop through the box cords and draw rectangles to a copy of our image.
The cords of the box have this format: [x,y,w,h]

image_draw = np.array(img_open).copy()
detector = cv2.CascadeClassifier('./models/haarcascade_frontalface_alt_tree.xml')
image_gray = cv2.cvtColor(np.array(img_open), cv2.COLOR_RGB2GRAY)

faces = detector.detectMultiScale(image_gray)

for cords in faces:

plt.figure(figsize = (10,10))
Enter fullscreen mode Exit fullscreen mode

We can also get the face landmarks using the lbfmodel

curl -L "" -s  --output ./models/lbfmodel.yaml
Enter fullscreen mode Exit fullscreen mode

This models requires the box cords for each face, so we loop through each of them and we calculate the face landmarks.

landmark_detector = cv2.face.createFacemarkLBF()

for cords in faces:
  _, landmarks =, faces)

  for landmark in landmarks:
    for i,(x,y) in enumerate(landmark[0]):, (int(x),int(y)),2,(255,255,255),1)
      image_draw = cv2.putText(image_draw, f"{i}", (int(x-5),int(y-5)), cv2.FONT_HERSHEY_SIMPLEX,  
                  0.4, (50, 255, 50) , 1, cv2.LINE_AA)

plt.figure(figsize = (40,40))  
Enter fullscreen mode Exit fullscreen mode

Woman with face landmarks plotted and a green face box

Crop the face and recalculate the facebox to the full size

So as you can see we compute the face box cords with a rescale of the original image let's see how to get the box for the original image.

First let's see a function that given the box face cords, the shape of the original image and the shape of the rescale, computes the new box.

So here what we basically have to do is calculate a ratio and then multiply the box for that ratio, we loop through all the boxes and we multiply the new x,y, and the new height and width.

def recal_box(box_cords, old_shape, new_shape):

    for box in box_cords:
        Ry, Rx=new_shape[0]/old_shape[0], new_shape[1]/old_shape[1]

        x,y,w,h = box
        new_y, new_x = int(Ry*y), int(Rx*x)

        new_h, new_w = int(Ry*h), int(Rx*w)

        recal_boxes.append((new_x, new_y, new_w, new_h))

    return recal_boxes
Enter fullscreen mode Exit fullscreen mode

Now let's overview the process of using the function, here we load the images to get the shape or rather the height and width of the original image we have the width of the rescale but not the height so we also get that.
Then we calculate the ration that we need and we give that to function recal_box.

image_bytes_s = requests.get(test_recal)

image_bytes_s = image_bytes_s.content
image_stream_s = BytesIO(image_bytes_s)
img_open_s = np.array(

image_bytes = requests.get(test_recal.split('?')[0])
image_bytes = image_bytes.content
image_stream = BytesIO(image_bytes)
img_open = np.array(


new_box=recal_box([[411, 268, 149, 149]],og_shape, new_shape)
Enter fullscreen mode Exit fullscreen mode

And we have our result!
Image description

We can do this for all the images that we encountered in the dataset and have the face box for the original image!

Downloading images!

Alright last thing is downloading all the faces!
Let's say that we computed all the images and we have a csv with the link of the original image and the face boxes for that image.

import numpy as np
import pandas as pd
import os
from io import BytesIO
import requests
from PIL import Image
import cv2
import numpy as np
from pathlib import Path
import argparse

def cropface(image, box, fill=.5, ratios=(1,1)):
    h_img,w_img = image.shape[:2]

    Ry, Rx = ratios
    x,y,w,h = box

    new_y,new_x = Ry*y, Rx*x
    y_fill = max(0, new_y-h*fill)
    x_fill = max(0, new_x-w*fill)

    new_h, new_w = Ry*(h+y), Rx*(w+x)

    h_fill = min(h_img, new_h+h*fill)
    w_fill = min(w_img, new_w+w*fill)

    return image[int(y_fill):int(h_fill),

def get_opt():
    parser = argparse.ArgumentParser()
    parser.add_argument('--source', type=str, required=True)
    parser.add_argument('--output', type=str, default='./unsplash_faces')

    opt = parser.parse_args()
    return opt

if __name__ == '__main__':
  opt:dict = get_opt()
  Path(opt.output).mkdir(exist_ok=True, parents=True)

  print('Loading photos df')
  photos_df=pd.read_csv(opt.source, sep=';', header=0)
  print('Finish photos df')

  for j, r in photos_df.iterrows():

    cur_cords = eval(r['face_box_cords'])
    cur_img = r['photo_image_url']

    image_bytes = requests.get(cur_img.split('?')[0])

    image_bytes = image_bytes.content
    image_stream = BytesIO(image_bytes)
    img_open = np.array(

    name = cur_img.split('/')[-1]
    for i, cords in enumerate(cur_cords):
      cur_crop = cropface(img_open, cords, fill=0, ratios=(1,1))
        cv2.imwrite(os.path.join(opt.output,f'{name}_{i}.jpg'), cur_crop[:,:,::-1])
        print(f'Error with {cur_img}')


Enter fullscreen mode Exit fullscreen mode

Now here we first declare the cropface function that given an image, and a box crops the image and returns that crop.
This function can also recalculate the rations and give a fill to the box if we want.
We slice the image using the face cords, and we use max and min to ensure the fill doesn't go over the edges, we don't want a out of index error :D

We use argparse for the options of the script, then we iterate over the csv of our data and we do the same process as before, but here we only need to crop the face and save it.

Wonderful that's it! Hope you like it, follow me and also check my YT!

1:Photo by Gift Habeshaw
2:Original Photo by freestocks

Top comments (0)