Problem with SGD

1. Slow progress in certain types of loss functions

Untitled

2. local minima and saddle point

→Saddle points much more common in high dimension

Untitled

3. Estimating loss through minibatch can lead to noise

Untitled

SGD+Momentum

Untitled

Nestrov Momentum

Untitled

AdaGrad

grad_squared = 0
while True:
    dx = compute_gradient(x)
    grad_squared += dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-17)

What is happening with AdaGrad?

→ progress along “steep” directions is damped, and progress along “flat” directions is accelerated

What happens to the step size over long time?

→ Decays to zero