If I showed you the two images above posed the classic image-search game of Where's Waldo to you, you more likely than not are first going to glance at Waldo to make sure you have an idea of what He looks like, and then will set about searching in the left image to find him. This idea is exactly the same premise upon which the Siamese Neural Network was applied to the task of object-finding or localisation. But before we get to there, we need some context on how this idea came about
The Siamese network architecture was first proposed in the early 1990s. At the time, the idea of convolutional neural networks (CNNs) for image recognition had just begun to gain traction, CNNs began becoming popular for a wide range of image-related applications. However, it wasn't until the 2000s that CNNs became the forefront of image-processing application research (Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review) owing to the lack of processing power and the existence of sufficiently large datasets, which highlights two of the main issues faced with CNNs: size of data and computational complexity
CNNs require a LOT of data, and I mean, a LOT of data. (If data were cookies, neural networks would be a King-Kong-sized cookie monster). The exact number of instances needed to train a network is never a fixed number, and varies widely depending on application and required outputs. (However, there are still efforts being made to relate the size of a system to the amount of data, see How Much Data is Enough? A Statistical Approach with Case Study on Longitudinal Driving Behavior
The second issue is the computational complexity of a network. Dense layers in neural networks are essentially a bunch of non-linear functions, which is trained to estimate patterns in input data; slap convolutions at the base of your neural network, and the computational complexity goes up by a constant factor of the product of your image dimensions (two-dimensional image in the worst case goes up by MxN computations for each pixel per layer). The TLDR is that, neural networks need a lot of horsepower to train, and according to this MIT Paper, a significant amount of processing power to run inference.
The Siamese network was proposed in an attempt to alleviate at least one of the above issues for the purposes of verifying signatures for the postal service (a form of image recognition). With a Siamese network, the target image (to be recognized) and the query image (where our target supposedly lives, and we are seeking to find our target within) is presented simultaneously to the network at time of inference.
The idea was that taking two identically pretrained networks and having both process two images at the same time, would result in some form of embedded output that was directly comparable.
See below for the original schematic from the 1993 paper, which was the earliest source I could find for the idea.
Remember, a convolutional neural network with its top-most layer removed is essentially an overpowered feature extractor. The analogy of a Siamese network is showing someone the colour red (our target), and then telling them to pick out a red card from a set of cards (our potential query images)
While this may seem like double the hassle of attempting to run inference on a single network (2xTrouble > 1xTrouble?), the proposed benefit was that only a single instance of the target image need be available at any point in time. This eliminated the need to have gigantic sets of image data for identifying small classes of images. In theory, the number of classes of images that could be recognized (or number of signatures that could be verified) was limited only by the number of target images available, and computational complexity at inference time
The above application of Siamese networks describes this two-stream network approach to object recognition A more recent application of the Siamese network was proposed for object-localization (instead of saying whether our query matches our target, we instead say where in our query our target may be located)
From our initial Waldo problem, our query image is the gigantic canvas of brightly coloured 'stuff' on the left, and our target is Waldo
One of the first papers simply added a selective search layer just before the output, which generated a set of potential bounding boxes in the query image known as object proposals; if the target image matched any of the proposed bounding boxes, the output of the entire network would be the coordinates of the bounding box within which the target was found (in other terms, the location of the target in the query)
An additional innovation was the aggressive use of max-pooling layers for the purposes of reducing resolution in the convolutional stages. (A max-pooling layer takes a small grid of pixels, usually 3x3, and returns only the pixel with the highest value, thereby a 3x3 block becomes 1x1). Whilst this may seem counter-intuitive, this technique was shown to propagate areas only with the most likely matches through the network, and increased the chances of the network at finding the target image in the query
A more recent paper married the idea of a Region-Proposal Neural Network (RPNN) and the Siamese Network in an attempt to reduce the computational burden of the selective search method above. The idea was that a network 7 layers deep (5 convolutional and 2 fully-connected) would be used as the feature-extracting layers in the Siamese network. The outputs of the two individual pathways (a siamese network is two neural networks side-by-side), are the fed to the region-proposal layer, which uses the cross-correlation function to determine similarities between two images (viz. signals)
This idea is motivated by Euclidean Distance where the new, cross-correlated image is a function of the two original images (from either side of the network, one corresponding to the target and the other, the query)
The above essentially says to sum the differences over the x and y axes between the query image f and the target image or feature t at position (u,v). However, an issue was found where very bright locations would result in abnormally high values in the output image, even if the target feature was not present in the query. The proposed solution was normalized cross-correlation, which additionall has the benefit of imposing invariance w.r.t lighting conditions and feature size. This new function
essentially pushes the target and query means to zero, and scales using the product of the standard deviations of their features. In order to fuse multi-channel score mappings into a single output image, a single 1x1 convolution is used.
The below shows a Diagram from a 2018 IEEE paper, which details how the different convolutional layers contribute to the score-map or heat-map for the purposes of bounding-box region proposal.
The bright red spot is where the Siamese network thinks the target is in the search region
Up to this point, I didn't get into the details of the loss function for the Siamese network for one reason: it depends heavily on application. What you measure to tell whether an image is of the same class as another (object recognition), is different from what you measure to tell where a target image is located in your query (object localization) .
For the specific purposes of Region-Proposal Object Tracking, the loss function used is:
In the above, y is a ground-truth label (+1 for whether the bounding-box found actually matches the target image, and -1 for a bounding box that doesn't match), and v is the actual real-valued score produced by the Siamese network on whether or not the box found in the query is actually the target. It is the mean of this value that is used as the net loss for the network giving a prediction.
In short, the network is penalised for predicting bounding boxes in the query image that do not match the image specified as its target (In most testing, these networks are usually trained on a balanced set of positive and negative samples), where a larger loss is as a result of a wrong/negative proposal
The Siamese neural network is essentially two networks in parallel working to generate outputs which can, in theory, be directly compared. It transforms both inputs into a common feature space, for the purpose of minimizing a loss function to achieve some task, whether it be object localization, recognition, etc.
This blog post sought to give a brief, but thorough introduction into the history and evolution of this incredibly cool neural network architecture. Hope you learnt something!