With the similarity scoring function described above, the authors used a very simple tracking framework in their papers.
Calculate the embedding of the example image $\phi(z)$ – from the first frame of the video.
Slide a window over the test image (each subsequent frame) and calculate the embeddings of each window – conceptually. It is implemented as a fully-convolutional neural network.
Calculate the cross-correlation between $\phi(z)$ and $\phi()$ of every window – conceptually. It is implemented as a convolution layer.
The output of the above step is a 2D scoring map. It can be intuitively interpreted as the score that the example object appears at each position of the test image.
Just find the highest score and back-track its position on the test image.
This vanilla framework doesn’t maintain states, doesn’t remember history, and doesn’t improve itself.
Note: it is definitely possible to use more sophisticated tracking framework. It is just that this simple framework works amazingly well.
Adapting the tracker to the current video as detection/tracking goes on is an attractive thing to have. The authors brought it into work in the CVPR’17 paper by inserting a component that is updated after every frame during tracking (see figure below). Please refer to the paper for the math details and the challenge of making it end-to-end trainable.
The bottom line is: (1)this model is updated at test time to capture specifics of the current object of interest; (2) it is fast enough to run at real time.
To train a Siamese network, we need to sample anchor, positive and negative points from the training set and train with triplet loss.
Some useful tricks to note:
Use “hard negative” instead of purely random samples to train. It improves the network more efficiently and trains much faster.
Sample from the same video and within a time window T – yes this is bias, but a reasonable one to have!