What is a Convolutional Neural Network (CNN)? How can they be used to detect features in images? This is the video of a live coding session in which I show how to build a CNN in Python using Keras and extend the "smile detector" I built last week to use it.
A Convolutional Neural Network is a particular type of neural network that is very suited to analysing images. It works by passing a 'kernel' across the input image (convolution) to produce an output. These convolutional layers are stacked to produce a deep learning network and able to learn quite complex features in images.
In this session I coded a simple 3-layer CNN and trained it with manually classified images of faces.
Much of the code was based on the previous iteration of this. Subsequent to the live coding session, I actually refactored the code to use python generators to simplify the processing pipeline.
This method opens the video file and iterates through the frames yielding each frame.
def frame_generator(self, video_fn): cap = cv2.VideoCapture(video_fn) while 1: # Read each frame of the video ret, frame = cap.read() # End of file, so break loop if not ret: break yield frame cap.release()
Like in the previous session, we iterate through the frames to calculate the different between each frame and the previous one. It then returns the threshold needed in which to filter out just the top 5% of images:
def calc_threshold(self, frames, q=0.95): prev_frame = next(frames) counts =  for frame in frames: # Calculate the pixel difference between the current # frame and the previous one diff = cv2.absdiff(frame, prev_frame) non_zero_count = np.count_nonzero(diff) # Append the count to our list of counts counts.append(non_zero_count) prev_frame = frame return int(np.quantile(counts, q))
Another generator that takes in an iterable of the frames and a threshold and then yields each frame whose difference from the previous frame is above the supplied threshold.
def filter_frames(self, frames, threshold): prev_frame = next(frames) for frame in frames: # Calculate the pixel difference between the current # frame and the previous one diff = cv2.absdiff(frame, prev_frame) non_zero_count = np.count_nonzero(diff) if non_zero_count > threshold: yield frame prev_frame = frame
By factoring out the methods above we can chain the generators together and pass them in to this method to actually look for the smiliest image. This means that (unlike the previous version) this method doesn't need to concern itself with deciding which frames to analyse.
We use the trained neural network (as a Tensorflow Lite model) to predict whether a face is smiling. Much of this structure is similar to last session in which we first scan the image to find faces. We then align each of those faces using a facial aligner -- this transforms the face such that the eyes are in the same location of each image. We pass each face into the neural network that gives us a score from
1.0 of how likely it is smiling. We sum all those values up in order to get an overall score of 'smiliness' for the frame.
def find_smiliest_frame(self, frames, callback=None): # Allocate the tensors for Tensorflow lite self.interpreter.allocate_tensors() input_details = self.interpreter.get_input_details() output_details = self.interpreter.get_output_details() def detect(gray, frame): # detect faces within the greyscale version of the frame faces = self.detector(gray, 2) smile_score = 0 # For each face we find... for rect in faces: (x, y, w, h) = rect_to_bb(rect) face_orig = imutils.resize(frame[y:y + h, x:x + w], width=256) # Align the face face_aligned = self.face_aligner.align(frame, gray, rect) # Resize the face to the size our neural network expects face_aligned = face_aligned.reshape(1, 256, 256, 3) # Scale to pixel values to 0..1 face_aligned = face_aligned.astype(np.float32) / 255.0 # Pass the face into the input tensor for the network self.interpreter.set_tensor(input_details['index'], face_aligned) # Actually run the neural network self.interpreter.invoke() # Extract the prediction from the output tensor pred = self.interpreter.get_tensor( output_details['index']) # Keep a sum of all 'smiliness' scores smile_score += pred return smile_score, frame best_smile_score = 0 best_frame = next(frames) for frame in frames: # Convert the frame to grayscale gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) # Call the detector function smile_score, frame = detect(gray, frame) # Check if we have more smiles in this frame # than out "best" frame if smile_score > best_smile_score: best_smile_score = smile_score best_frame = frame if callback is not None: callback(best_frame, best_smile_score) return best_smile_score, best_frame
We can then chain the functions together:
smiler = Smiler(landmarks_path, model_path) fg = smiler.frame_generator(args.video_fn) threshold = smiler.calc_threshold(fg, args.quantile) fg = smiler.frame_generator(args.video_fn) ffg = smiler.filter_frames(fg, threshold) smile_score, image = smiler.find_smiliest_frame(ffg)
Testing it out it all works pretty well, and finds a nice snapshot from the video of smiling faces.
The full code to this is now wrapped up as a complete Python package:
This is a library and CLI tool to extract the "smiliest" of frame from a video of people.
% pip install choirless_smiler
% smiler video.mp4 snapshot.jpg
It will do a pre-scan to determine the 5% most changed frames from their previous frame in order to just consider them. If you know the threshold of change you want to use you can use that. e.g.
The first time smiler runs it will download facial landmark data and store it in
location of this data and cache directory can be specified as arguments
% smiler video.mp4 snapshot.jpg --threshold 480000
% smiler -h usage: smiler [-h] [--verbose] [--threshold THRESHOLD] [--landmarks-url LANDMARKS_URL] [--cache-dir CACHE_DIR] [--quantile QUANTILE] video_fn image_fn Save thumbnail of smiliest frame in video positional arguments video_fn filename for video to
I hope you enjoyed the video, if you want to catch them live, I stream each week at 2pm UK time on the IBM Developer Twitch channel: