In theory, a deeper network should perform no worse than a shallower one, because we can construct by adding identity layers to the shallow.
Conjecture: multiple non-linear layers have difficulty learning the identity function.
Due to how neural nets are designed and trained, it is easier to learn near-zeros weights than to learn identity weights.
The “optimal” mapping is closer to the identity mapping than to zero mapping (conjecture).
The “basic” design (left): Add a shortcut connection that adds the input to two consecutive layers directly to their output. It must be at least two layers in between otherwise it is equivalent to a single layer.
The “bottleneck” design (right): 1x1 conv layers reduces and restores the dimension, leaving the 3x3 conv layers to process fewer dimensions.
When F(x) and x have different shapes, there are a few options, including padding zeros or using a project matrix. See details in the paper.
Even with 152 layers, ResNet-152 still has fewer FLOPs than VGG-16/19 (15.3/19.6 billion).
Going from ResNet-34 to ResNet-152 sees considerable benefit (28.54 -> 21.43 error), which is not achievable previously.