ECCVw’16 Fully-Convolutional Siamese Networks for Object Tracking

CVPR’17 End-to-end representation learning for Correlation Filter based tracking


This series of work presents an algorithm for real-time object tracking. Particularly, they have the following advantages:

  • It runs faster than real time (50 - 80 fps)

  • You can start tracking by only giving the bounding box in the first frame of a video.

  • It uses CNN. Sounds cool?

  • In particular, it uses fully-convolutional networks (no FC layers), so can adapt to any test image size.

  • It is class-agnostic, and requires no class-specific training data.

  • Using a pre-trained network, you basically don’t need to do any re-training and can simply use the pre-trained model in inference mode for your tasks/objects.

  • It is end-to-end trainable.

Below are true for the ECCVw’16 paper:

  • During tracking, only inference happens, no training/fine-tuning. That’s one reason why it’s fast.

  • It maintains no internal states.

Below are true for the CVPR’17 paper:

  • It can adapt to video-specific features every frame as the tracking goes on (so it does maintain internal states and finetune at test time), but at real-time speed.

As far as I understand, below are some limitations:

  • The detection position is extrapolated from a lower-resolution score map, so it is not highly-accurate (but good enough).

  • Because there is no box regression (thus it’s fast), the bounding boxes are usually not highly-accurate or tight.

How it works?

Siamese network and cross-correlation

If you are unfamiliar with Siamese networks and triplet loss, you can learn some basics through this video by Andrew Ng and the one following it.

It uses Siamese networks to construct what is essentially a class-agnostic similarity scoring function between two image patches. The Siamese network transforms an image patch (e.g., of the target object z ) into an embedding (e.g., the 6x6x128 vector). The embeddings of two image patches are then used to calculate their cross-correlation. An ideal transformation should be like so: it generates high cross-correlation between images of the same object, while low cross-correlation between images of different objects.

Basically, the authors showed: using a sufficiently large amount of training data and triplet loss, a shallow (thus fast inference time) Siamese network can learn a transformation which effectively separates same/different objects, and generalizes well to unseen classes (thus class-agnostic).


Very simple tracking framework (aka. sliding window)

With the similarity scoring function described above, the authors used a very simple tracking framework in their papers.

  1. Calculate the embedding of the example image $\phi(z)$ – from the first frame of the video.

  2. Slide a window over the test image (each subsequent frame) and calculate the embeddings of each window – conceptually. It is implemented as a fully-convolutional neural network.

  3. Calculate the cross-correlation between $\phi(z)$ and $\phi()$ of every window – conceptually. It is implemented as a convolution layer.

  4. The output of the above step is a 2D scoring map. It can be intuitively interpreted as the score that the example object appears at each position of the test image.

  5. Just find the highest score and back-track its position on the test image.

This vanilla framework doesn’t maintain states, doesn’t remember history, and doesn’t improve itself.

Note: it is definitely possible to use more sophisticated tracking framework. It is just that this simple framework works amazingly well.

Learning video-specific features

Adapting the tracker to the current video as detection/tracking goes on is an attractive thing to have. The authors brought it into work in the CVPR’17 paper by inserting a component that is updated after every frame during tracking (see figure below). Please refer to the paper for the math details and the challenge of making it end-to-end trainable.

The bottom line is: (1)this model is updated at test time to capture specifics of the current object of interest; (2) it is fast enough to run at real time.



To train a Siamese network, we need to sample anchor, positive and negative points from the training set and train with triplet loss.

Some useful tricks to note:

  • Use “hard negative” instead of purely random samples to train. It improves the network more efficiently and trains much faster.

  • Sample from the same video and within a time window T – yes this is bias, but a reasonable one to have!