AlexNet

VGGNet

- Stack of three 33 conv (stride 1) layers vs 77 conv
- Same effective field as one 7*7 conv layer
- Fewer param
- Deeper, more non-linearities
GoogLeNet


- Probelm
- Total 845m ops → very expensive to compute
- Pooling layer preserves feature depth → growing depth every layer
- Solution
- “Bottleneck” layers that use 1*1 convolutions
- Total 358m ops
ResNet
- What happens when we continue to stack deeper layers on a “plain” convolution NN?
- A deeper model is performing worse than a shallower model and it’t not because of overfitting as the training error is large

- Deep models have more representation power but seems like they are harder to optimize. → How can we make deep models preform just as good as shallow models?
- solution : copy the learned layers from the shallower model and set additional layers to identity mapping
