Cute Animal Detection (2 Part Series)
When a data scientist colleague of mine recently found out I have a background in mobile app development, he asked me to show him how to use a machine learning model in a mobile app. I figured an image classification app, like the classic Not Hotdog, would be a nice example, since it also requires hooking into the phone’s camera and is not just a model running on a phone instead of desktop.
It took me a while to piece together everything that I needed to make a finished app. That’s why I thought writing a post would be useful, so that the full journey is documented in one place. All the code that I wrote, fetching the training images, training the model, and the app itself, is available on Github. The code is comprehensively commented, since I wanted it to be useful for both data scientists and mobile app developers.
At first I thought I’d just build a hot dog detector, and found a post that went through the model building part. But I don’t have many hot dogs around, so demoing such an app would be difficult. Since I like cute animals, I decided instead to make a cuteness detector, which, as a more abstract concept, can be demoed in many everyday environments.
For the source of training data, I picked ImageNet. I decided to take puppy and kitten pictures as my “cute” training data, and somewhat arbitrarily, creepy-crawlies and ungulates as my “not cute” training data. After filtering out non-available images, I was left with 1718 cute images and 1962 non-cute images. Verifying visually that I had good data definitely validated the choice of creepy-crawlies as “not cute”. Brrrrr…
For custom image classification, transfer learning is the way to go. Training an image classifier from scratch would take ages and any network architecture I could come up with couldn’t be better than current state of the art. I picked InceptionV3, since I found a good article with example code of how to make a custom image classifier with it.
The key points in building the model are to take the InceptionV3 model as base, add a new classification layer, and limit the training to the new layer. Since the training data comes from ImageNet, the InceptionV3 constructor is told to use those weights. The other parameter says to exclude the top layer, which does the actual classification, since I will be using my own layer for that.
Nowadays, image classification has moved away from using pooling after each convolutional layer and to using global average pooling at the end. So the base InceptionV3 model is extended with a GAP layer, and finally a softmax classification layer with two nodes, since there are two classes to predict, cute and not cute. In the final step, all the layers from the base InceptionV3 model are marked as not trainable so that the training applies only to the added custom layers.
I had downloaded my training images, and now there is a little bit of work to prepare them for training. A good way to set up data for Keras is to make directories for each of training, validation, and testing data sets. Then, in each of these directories, create a subdirectory for each label and put the images for each label in the corresponding subdirectory. So I have directories “cute” and “notcute”. This lets Keras automatically pick up the labels during training.
Augmenting the data is always a good idea when doing image classification. Keras comes with the ready-made ImageDataGenerator that can be used to automatically augment image data with different transformations.
The ImageDataGenerator is initialized with parameters on which kinds of transformations it should perform. I didn’t pick any that would introduce distortions, since a distortion might well remove cuteness. InceptionV3 expects the pixels to be in the range [-1, 1], and it provides the function
preprocess_input to perform the min-max scaling to this range. This function is passed to the generator so that the generated images will be in the appropriate format.
The generator is then set to produce the images from the image directory. The parameter dataset can be either “train” or “test”, depending on whether to generate training or validation data. The image size 299x299 is what is expected by InceptionV3.
Training the model with the generators is simple enough. Five epochs appears suitable for this task. In fact, I didn’t see improvement after the third epoch. With these parameters, the training on my Macbook took about 2.5 hours, which I felt was quite reasonable.
The last line above,
model.save('iscute.h5'), saves the model in the HDF5 format used by Keras. This is not the format used for mobile apps. For Android, I chose to use tensorflow-lite. (I wanted to make an iOS app using CoreML, but I was using Python 3.7, and the CoreML tools do not yet support that version.) Tensorflow comes with the appropriate converter, so there is not very much code here. The HDF5 file I got is 84 MB in size, while the tensorflow-lite file is 83 MB, so there was not much size reduction.
The only headache here was discovering the need for the
input_shapes parameter. On my initial run without that parameter, I got the error message “
ValueError: None is only supported in the 1st dimension. Tensor 'input_1' has invalid shape '[None, None, None, 3]'”. The required dimensions are the image dimensions 299x299, and the error message says what tensor name to use as the key in the dictionary parameter.
Now I had my model ready to use in the app, so it was time to build the app. It needs to do the following:
- Show the camera preview on screen
- Let the user press a button to capture the current image
- Feed the captured image to the classifier
- Display the result after classification is complete
The parts I didn’t yet know how to do were 1 and 3. Furthermore, the Android camera API had changed, and many tutorials I could find were still written for the old, deprecated version. Luckily I did find one post that detailed everything that needs to be done. There is quite a bit of code needed to access the camera image, so I won’t show most of it here. Check the repository for the full code.
To prepare for image capture requires getting a
CameraManager, asking the user for permission to use the camera (unless already granted), and getting a reference to a
TextureView that will contain the preview. A real app would need to prepare for the camera permission not being granted as well.
The camera cannot display anything until the
TextureView is available, so the app may have to wait for that by setting a listener. Afterwards the app needs to find an appropriate camera (the back-facing one here) and open it.
openCamera() method passes in a state callback. When the camera is opened, this callback will be informed, and it can trigger creating a preview session with the camera so that the camera image is available when the user presses the evaluation button.
To figure out how to make the classifications, I briefly looked at the Tensorflow-lite guide. While it helped to get started, it was much less help with the actual details of the code. For the details, I went through the sample app that does image classification, and extracted out the specific bits I needed out of the very general code.
Initializing the classifier means loading the model into memory from the app’s assets, and creating the Tensorflow
Interpreter object out of that. The input for an image classifier is usefully given as a
ByteBuffer and it’s so large that it makes sense to pre-allocate it during initialization.
When running the classifier on an image, it’s important to perform it in a background thread. It takes a few seconds to run, and holding up the UI thread for that long is simply not acceptable. The first thing to do is to create the input in the correct format in the buffer: Scale the image to 299x299 and run the min-max scaling to get the pixel values to the range [-1, 1], as expected by InceptionV3.
And finally, the classifier can be run. The type of the
run() method is not very helpful, as it just takes an
Object as both input and output. As mentioned above, for image classification,
ByteBuffer is a good input format. The output in this classification case is a one-element array, the element of which is an array of floats whose size is the number of classes. After the run, this array of floats will be filled with the probability of each class. So in my case, it is a two-element array, with the elements being the probabilities of “cute” and “not cute”.
I did not find a way to determine which class corresponds to which index in the array at this stage. So the class-to-index mapping would need to be extracted from the Keras model, and possibly included as data in the app, in case the mapping is not stable between different training runs.
In the end, it was not too difficult to build an image classifier app. It’s cross-cutting across a few areas, so the information is not all in one place, but when you look, it’s all available somewhere. The app I built is of course only a demo, but knowing the principles and methods now, I am confident that I could add this kind of functionality in a real app too.
Yep, I think my model works well.